No decision tree created with parameter criterion to "gini_index"
Good morning,
I used the "Decision Tree" operator to create a model with a training dataset.
With parameter "criterion" to "gini_index" no decision tree is created on the results : The differents attributes are not taken into account.
When the parameter "criterion " is "accuracy", or "gain-ratio" or "information_gain", the decision trees are good created.
My training dataset and scoreset are in attached files
Here my process in xml :
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Training" width="90" x="112" y="34">
<parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Training"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="User_ID"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="34">
<parameter key="attribute_name" value="eReader_Adoption"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="514" y="34">
<parameter key="criterion" value="gini_index"/>
<parameter key="maximal_depth" value="20"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Scoring" width="90" x="112" y="238">
<parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Scoring"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="238">
<parameter key="attribute_name" value="User_ID"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="136">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<connect from_op="Training" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Scoring" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Is it a bug ?
Can you help me ?
Thank you
Lionel
Best Answers
-
earmijo Member Posts: 271 Unicorn
Let me add a couple of sentences to Thomas_Ott's answer. I was confused myself when I started using RapidMiner.
You can find a nice and clear explanation of both pruning and pre-pruning here:
Machine Learning: Pruning Decision Trees
You should experiment in your process with all the variations.
Pre-pruning (early stopping): You stop splitting if no significant benefit results from an additional split.
Pruning (post-pruning): You keep splitting until you reach the desired number of levels (depth = the main measure of complexity of the tree) but you try to simplify the tree afterwards.
Neither Pre-pruning nor Pruning : Try it. The tree will grow symmetrically until reaching the desired number of levels (depth).
IF processing time is not an issue, there is no reason to ever use the pre-pruning option. In .the worst case, you'll end up with the same performance metric, but there is a chance (real as your example illustrates) that you'll end up doing better with (post-pruning).
1
Answers
@lionelderkrikor you have to also understand that the criterion all have different ways of the splitting the dataset into a tree. It might be that gini_index is not a good criteron to split your data.
Hi,
earmijo
By unchecking Apply Pre-pruning, a decision tree is good created in my case.
I'm beginner in RM and data-science : Can you explain me what is the goal of checking "Pre-pruning" ? In which case(s) must I check (or not) this option.
Because in my case, when checked (and all related parameters set to the default value), there is only one node with as conclusion the class (it is a four class label attribute problem) which is in majority in the training set (refer attached file). So when applied, this model predict this one unique class to the entire score data set.
Thanks you
Lionel
Hi Thomas,
Thank you for your explanation. I understand better the role of these options.
Regards,
Lionel
Hi @earmijo
Thank you for your feedback and your ressources about decision trees.
If I understand, I must be very careful when using decision trees :
I have to try all combinaisons [criterion - no apply /apply pruning - no apply / apply prepruning] and
perform an evaluation of the accuracy of the created models using a split validation to select the best model.
Regards,
Lionel
Hi,
i would be careful with a simple split-validation and rather use a X-Validation with a proper hold out set.
Best,
Martin
Dortmund, Germany
Hi @mschmitz,
Thank you for your advise : I'll use a X-Validation on the models.
Regards,
Lionel