Execute Python failed in a Optimization / Cross validation operator
Hi,
I use the "Execute Python" operator to perform a generation of dummy variables on a dataset.
I know that this function can be performed with the "Nominal to Numerical " operator or not to be performed at all.......
but I discovered that without X-validation/Optimization, the created decision tree is not the same (and its associated prediction/accuracy) when the dummy variables are generated by "Nominal to Numerical " or generated by "Execute Python" which seems to be weird.....
In my case, the 2 "Execute Python", which are respectively in the training and test parts of a "cross validation" operator, itself in
an "Optimization" operator, seems to be not executed and then the process failed.
Here my process :
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="391">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (3)" width="90" x="179" y="391">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="112" y="544">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="313" y="85">
<list key="parameters">
<parameter key="Decision Tree.criterion" value="gain_ratio,information_gain,gini_index,accuracy"/>
<parameter key="Decision Tree.apply_pruning" value="true,false"/>
<parameter key="Decision Tree.apply_prepruning" value="true,false"/>
<parameter key="Decision Tree.maximal_depth" value="[-1.0;20;20;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="136">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="136">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="313" y="136">
<parameter key="maximal_depth" value="-1"/>
</operator>
<connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (3)" width="90" x="45" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="112" y="85">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="246" y="187">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="380" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Execute Python (2)" to_port="input 1"/>
<connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="391">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="391">
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (3)" to_port="input 1"/>
<connect from_op="Execute Python (3)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Retrieve Golf" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 4"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 5"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model" to_port="model"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 6"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<connect from_op="Performance" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>
My approach seems to be futile, but maybe there is a bug on the "Execute python" operator and it will help those who use
this operator for more useful tasks.
Thank you for your help,
Regards,
Lionel
Best Answer
-
JEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
The problem showing using your Golf dataset is that the attributes don't match. Using breakpoints I can see that your Test data fold only contains one record (and your training is also on a small number).
And because you are converting to dummy variables on each side of training & testing then it's pretty likely that some attributes won't match your model as your test data might be missing important details.
This is bad practice and I recommend that you feed your preprocessing model through the RapidMiner process to work on it.
However, as you did state you wanted to use this way what you need to do is ensure that the attributes of your dataset matches the output. You can do this with operators like Superset. See below XML.
Maybe you could also post an example of the incorrect results you're getting with the Nom to Num operator?
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="391">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (3)" width="90" x="179" y="391">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="112" y="544">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="166" name="Optimize Parameters (Grid)" width="90" x="313" y="85">
<list key="parameters">
<parameter key="Decision Tree.criterion" value="gain_ratio,information_gain,gini_index,accuracy"/>
<parameter key="Decision Tree.apply_pruning" value="true,false"/>
<parameter key="Decision Tree.apply_prepruning" value="true,false"/>
<parameter key="Decision Tree.maximal_depth" value="[-1.0;20;20;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="112" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
<parameter key="use_underscore_in_name" value="true"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="289">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="112" y="187">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="246" y="85">
<parameter key="maximal_depth" value="-1"/>
</operator>
<operator activated="true" class="remember" compatibility="7.6.001" expanded="true" height="68" name="Remember" width="90" x="313" y="187">
<parameter key="name" value="myDataSet"/>
</operator>
<connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<connect from_op="Decision Tree" from_port="exampleSet" to_op="Remember" to_port="store"/>
<connect from_op="Remember" from_port="stored" to_port="through 1"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="126"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="45" y="85">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] ) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="179" y="187">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="superset" compatibility="7.6.001" expanded="true" height="82" name="Superset" width="90" x="313" y="238"/>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="380" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Execute Python (2)" to_port="input 1"/>
<connect from_port="through 1" to_op="Superset" to_port="example set 2"/>
<connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Superset" to_port="example set 1"/>
<connect from_op="Superset" from_port="superset 1" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="105"/>
<portSpacing port="source_through 2" spacing="21"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="87" resized="true" width="237" x="254" y="337">There should really be a replace missing values here too, but I didn't feel like adding it. :P</description>
</process>
</operator>
<connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="391">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="recall" compatibility="7.6.001" expanded="true" height="68" name="Recall" width="90" x="313" y="493">
<parameter key="name" value="myDataSet"/>
<description align="center" color="transparent" colored="false" width="126">This needs to happen AFTER the Optimize has run.</description>
</operator>
<operator activated="true" class="superset" compatibility="7.6.001" expanded="true" height="82" name="Superset (2)" width="90" x="447" y="442"/>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="391">
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (3)" to_port="input 1"/>
<connect from_op="Execute Python (3)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Retrieve Golf" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 4"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 5"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model" to_port="model"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 6"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Superset (2)" to_port="example set 1"/>
<connect from_op="Recall" from_port="result" to_op="Superset (2)" to_port="example set 2"/>
<connect from_op="Superset (2)" from_port="superset 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<connect from_op="Performance" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>2
Answers
Hi @JEdward
Thanks you for your response and your advices.
1. In deed, by performing the generation of dummy variables in the "Optimization" operator or in the main window of the process (after the training dataset), the process is running well. So it was not a problem with the "Execute Pyton" operator.
I did this, because in a previous topic, we told me that "the conversion into dummy variables need to be done inside of x-val to do it right".
2. Concerning the differences between the 2 methods of generation of dummies variables :
2.a Here the process using "Execute Python" :
2.b The associated results (python)
2.c Here the (same) process using "Numerical to numeral" :
3.d The associated results (Nominal to Numerical)
How can we explain this behaviour ?
Thank you.
Regards,
Lionel
TLDR: Change your datatype from Integer to Real for Temperature & Humidity.
Well this was interesting!
This is caused because your Execute Python process is parsing the numbers and changing Temperature & Humidity from Integer into Real data types. For some reason the Real datatype is performing significantly better than the Integer datatype for this dataset and I have absolutely no idea why. The two models produced are different in that the Real Decision Tree has a final split using Temperature 71, but the Integer Decision Tree uses Outlook = Rain as the final split. So it's probably related to the way splits are calculated. Anyone want to look at the DT code and see if they can spot why this is behaving like this?
And lastly here's your original process changed so it uses the Real conversion and also uses the Nom to Numeric in the RapidMiner way. (So the preprocessing model created in training is passed through to the Test part of the subprocess).
However, I would advise being careful about using accuracy as the performance measure here as the Decision Tree produced doesn't really classify items very well, despite the high accuracy it's actually just classifying every day as golf day. (Whilst this is might be true in US politics, it's not necessarily true in our dataset).
Hi @JEdward
First, thanks you for spending time to perform your analysis and to update my process.
I understand well that in this special case, accuracy is not a relevant performance measure.(I choose to post "by default" the performance window to "illustrate" the difference between the results of the 2 methods).
Until this mysterious behaviour is clarified, in practice, what did you recommend when using decision trees (and maybe other algorithms):
- Systematically, using the "Numerical to real" operator on the datasets to work with real values ?
- Systematically, execute a process twice (one with integer values / one with real values) to select the best model (because what is true in the specific Golf case, can be false in a other case) ?
- Do nothing, because in a real case study, a parametric optimization is performed, and the differences between "real results" and "integer results" will be "masked" (almost totally) by the optimization process ?
- Maybe an other approach ?
and to conclude : "every day as golf day" : Maybe this outdoor sport is good for body and mind and make the best decisions...........(to medidate).
Thanks you for your responses,
Best regards,
Lionel
Hi,
the decision tree is not caring about real or integer values, it is all a double array for the tree. What is influencing it is the order of the attributes for a very simple reason: When it searches for the best split and there are two attributes with the same benefit, then it takes the first attribute with this benefit.
When you look at the "Integer Decision Tree" and "Real Decision Tree" results in JEdward's process then you see that the difference at the lowest node leads to the same "purity" of the split (3 pure yes, 2 no with one wrong). When you put a breakpoint before the "Real Decision Tree" and "Integer Decision Tree" operators in the process, you see that for "Real Decision Tree" the attribute "Temperature" is the first, while for "Integer Decision Tree" the attribute "Outlook_rain" is first.
When you apply an "Type1 To Type2"-operator then the order of the attributes might change, in particular if you only change some of the attribute types.
Hi @gmeier,
Thanks you for your explanations about the behaviour of the DT in this case.
Now the causes of these mysterious results are clear for me.
Regards,
Lionel
Thanks! That clears it up. @gmeier
It also means that I shall play around with my own future trees by throwing in a loop with Reorder Attributes to put them in random order, optimized, or (more likely) in order of importance.