No decision tree created with parameter criterion to "gini_index"

lionelderkrikor · November 2017

Good morning,

I used the "Decision Tree" operator to create a model with a training dataset.

With parameter "criterion" to "gini_index" no decision tree is created on the results : The differents attributes are not taken into account.

When the parameter "criterion " is "accuracy", or "gain-ratio" or "information_gain", the decision trees are good created.

My training dataset and scoreset are in attached files

Here my process in xml :

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Training" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Training"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
        <parameter key="attribute_name" value="User_ID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="34">
        <parameter key="attribute_name" value="eReader_Adoption"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="514" y="34">
        <parameter key="criterion" value="gini_index"/>
        <parameter key="maximal_depth" value="20"/>
        <parameter key="apply_pruning" value="true"/>
        <parameter key="confidence" value="0.25"/>
        <parameter key="apply_prepruning" value="true"/>
        <parameter key="minimal_gain" value="0.1"/>
        <parameter key="minimal_leaf_size" value="2"/>
        <parameter key="minimal_size_for_split" value="4"/>
        <parameter key="number_of_prepruning_alternatives" value="3"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Scoring" width="90" x="112" y="238">
        <parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Scoring"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="238">
        <parameter key="attribute_name" value="User_ID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="136">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <connect from_op="Training" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Scoring" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Is it a bug ?

Can you help me ?

Thank you

Lionel

earmijo · November 2017

Try unchecking the setting Apply Pre-pruning

earmijo · November 2017

Let me add a couple of sentences to Thomas_Ott's answer. I was confused myself when I started using RapidMiner.

You can find a nice and clear explanation of both pruning and pre-pruning here:

Machine Learning: Pruning Decision Trees

You should experiment in your process with all the variations.

Pre-pruning (early stopping): You stop splitting if no significant benefit results from an additional split.

Pruning (post-pruning): You keep splitting until you reach the desired number of levels (depth = the main measure of complexity of the tree) but you try to simplify the tree afterwards.

Neither Pre-pruning nor Pruning : Try it. The tree will grow symmetrically until reaching the desired number of levels (depth).

IF processing time is not an issue, there is no reason to ever use the pre-pruning option. In .the worst case, you'll end up with the same performance metric, but there is a chance (real as your example illustrates) that you'll end up doing better with (post-pruning).

Thomas_Ott · November 2017

@lionelderkrikor you have to also understand that the criterion all have different ways of the splitting the dataset into a tree. It might be that gini_index is not a good criteron to split your data.

lionelderkrikor · November 2017

Hi,

earmijo

By unchecking Apply Pre-pruning, a decision tree is good created in my case.

I'm beginner in RM and data-science : Can you explain me what is the goal of checking "Pre-pruning" ? In which case(s) must I check (or not) this option.

Because in my case, when checked (and all related parameters set to the default value), there is only one node with as conclusion the class (it is a four class label attribute problem) which is in majority in the training set (refer attached file). So when applied, this model predict this one unique class to the entire score data set.

Thanks you

Lionel

Thomas_Ott · November 2017

Pruning and Pre-pruning are ways to reduce the overall complexity of the tree. The more complex the tree gets, the more it can overfit your data. Decision Trees are notorious for overfitting (or being abused to overfit). Pruning helps reducing the possibly (not eliminate) of overfitting.

lionelderkrikor · November 2017

Hi Thomas,

Thank you for your explanation. I understand better the role of these options.

Regards,

Lionel

lionelderkrikor · November 2017

Hi @earmijo

Thank you for your feedback and your ressources about decision trees.

If I understand, I must be very careful when using decision trees :

I have to try all combinaisons [criterion - no apply /apply pruning - no apply / apply prepruning] and

perform an evaluation of the accuracy of the created models using a split validation to select the best model.

Regards,

Lionel

MartinLiebig · November 2017

Hi,

i would be careful with a simple split-validation and rather use a X-Validation with a proper hold out set.

Best,

Martin

lionelderkrikor · November 2017

Hi @mschmitz,

Thank you for your advise : I'll use a X-Validation on the models.

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

No decision tree created with parameter criterion to "gini_index"

Best Answers

Answers