The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Decision tree and RapidMiner performance measures - how to understand them
I would like to ask for help in the following matter.
In a decision tree created with gain ratio I just receive the classification of every instance to some class. In my case, one of 2 classes.
I do not understand how the RMSE is calculated if this measure is based on the difference between actual value and predicted value. If my classes use index symbols 0 and 1, does it mean that always the difference is 0 or 1 between actual value and predicted value?
Similarly, I do not undestand the margin definition. The margin is defined as the minimal confidence for the correct label. Should I calculate confidence for all the nodes and take the minimum value?
Finally, I do not understand the soft margin.Soft margin loss is the average soft margin loss on a
classifier defined as the average of all 1- confidences for the correct label. How do I caculate 1-confidence for the correct label?
Tagged:
1
Best Answer
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistHi @Picia,
Thanks for the followup and clarifications. Yes, the gain ratio is the right criteria to use for classification trees.
If you are interested in the method used to calculate RMSE, Margin, Soft margin for classification performances, here are the open sourced java scripts behind that
https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/RootMeanSquaredError.java
https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/Margin.java
https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/SoftMarginLoss.java
Attached is the example process to manually calculate RMSE for the training performance
Simply put, the squared error(SE), aka "gap" between the real value (yes or no) and the prediction confidence are formulated in the "Generate Attribute" operator for each instance.
We use the SE to get MSE (mean squared error) by extracting the average statistics.
In the end, RMSE is the square root of MSE.
RMSE = Sqrt(MSE), Where MSE = Sum of Squared Error / N, N is the number of examples<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.5.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85"> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/> </operator> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.000" expanded="true" height="103" name="Decision Tree" origin="GENERATED_TUTORIAL" width="90" x="447" y="85"> <parameter key="criterion" value="gain_ratio"/> <parameter key="maximal_depth" value="20"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.25"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.1"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model" width="90" x="648" y="85"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_classification" compatibility="9.5.001" expanded="true" height="82" name="Performance" width="90" x="849" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="false"/> <parameter key="weighted_mean_recall" value="false"/> <parameter key="weighted_mean_precision" value="false"/> <parameter key="spearman_rho" value="false"/> <parameter key="kendall_tau" value="false"/> <parameter key="absolute_error" value="false"/> <parameter key="relative_error" value="false"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="false"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="false"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="cross-entropy" value="true"/> <parameter key="margin" value="true"/> <parameter key="soft_margin_loss" value="true"/> <parameter key="logistic_loss" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <list key="class_weights"/> </operator> <operator activated="true" breakpoints="after" class="generate_attributes" compatibility="9.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="136"> <list key="function_descriptions"> <parameter key="SE" value="if(Survived=="Yes",(1-[confidence(Yes)])^2,(1-[confidence(No)])^2)"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" breakpoints="after" class="extract_macro" compatibility="9.5.001" expanded="true" height="68" name="Extract Macro" width="90" x="1184" y="136"> <parameter key="macro" value="MSE"/> <parameter key="macro_type" value="statistics"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="SE"/> <list key="additional_macros"/> <description align="center" color="transparent" colored="false" width="126">MSE is the sum of squared error / n</description> </operator> <operator activated="true" class="generate_macro" compatibility="9.5.001" expanded="true" height="82" name="Generate Macro (2)" width="90" x="1318" y="136"> <list key="function_descriptions"> <parameter key="RMSE" value="sqrt(eval(%{MSE}))"/> </list> <description align="center" color="transparent" colored="false" width="126">Calculate RMSE bassed on squar root of SSE and number of example</description> </operator> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Decision Tree" to_port="training set"/> <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/> <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/> <connect from_op="Apply Model" from_port="model" to_port="result 2"/> <connect from_op="Performance" from_port="performance" to_port="result 1"/> <connect from_op="Performance" from_port="example set" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/> <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro (2)" to_port="through 1"/> <connect from_op="Generate Macro (2)" from_port="through 1" to_port="result 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="21"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> </process> </operator> </process>
HTH!
YY
6
Answers
If you have a binary label (0 or 1) for prediction with Decision Tree, the best way is to convert the target type from numeric to nominal and apply "performance (Binomial Classification)" operator to extract the measurements for classification models.
AUC, classification error, accuracy, recall, F-measurement, ect. are usually the metrics used for Binomial Classification.
In your example, RMSE is a commonly used error metric to measure the performance of regression models. I am not sure about the definitions of Margin or Soft Margin in the "Performance (Classification)". I will double check with the internal team and update later.
As a good reference, the log loss is defined here and commonly used in classification with the extra consideration of confidence values.
-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))
https://www.quora.com/What-is-an-intuitive-explanation-for-the-log-loss-function
Cheers,
YY
Then, how do I calculate the margin and soft margin.
In a decision tree I see no probabilities associated with an individual instance. The tree simply classifies each instance to some class. So what is the predicted value. What is the margin - some minimum value of confidence from all the nodes in a tree?
However, if I understand it correctly, the Example class represents only 1 instance from the data set. So for every instance there is a separate value of confidence.
I do not know how it is calculated for every instance. In the decision tree I can set the confidence level (probably this is the z value from the normal distribution and it is used to calculate confidence for pruning). But if every instance has got its own confidence, then I do not know how it is calculated.