"Append-Operator in Testing Phase of X-Validation changes confusion mattrix"

Muhammad · November 2014

Hi,

I am working on a classification problem where I have 3 classes [good (180), mediocre (4535), bad (183)]. (#number of examples in that class)

In my rapidminer process I only learn a model for "good" and "bad" and in the testing phase I want to modify the prediction depending on the confidence of my classifier. So I am filtering out all examples with low confidence and assign them to the "default class" "mediocre".
In order to do this reassignment I use a "filter example" operator together with a "replace" operator.

My problem is:
If I run my process without my reassignment step (i.e. filtering and replacing) I get the expected values for true good (180), true mediocre(4535) and true bad (183) in my confusion matrix. However, if I do the reassignment my confusion matrix yields unexpected values for true good, mediocre and bad.
Why is that happening?
My process as follows:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve DataSet-WhiteWine" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Local Repository/data/GroupProject_WineQuality_White"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="attribute_name" value="quality"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID" width="90" x="313" y="30">
        <parameter key="create_nominal_ids" value="false"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" class="normalize" compatibility="5.3.015" expanded="true" height="94" name="Normalize" width="90" x="447" y="30">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="method" value="range transformation"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="1.0"/>
      </operator>
      <operator activated="true" class="discretize_by_user_specification" compatibility="5.3.015" expanded="true" height="94" name="Discretize" width="90" x="581" y="30">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="quality"/>
        <parameter key="attributes" value="quality"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="attribute_type" value="nominal"/>
        <list key="classes">
          <parameter key="bad" value="4.0"/>
          <parameter key="mediocre" value="7.0"/>
          <parameter key="good" value="10.0"/>
        </list>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.3.015" expanded="true" height="112" name="Validation" width="90" x="715" y="30">
        <parameter key="create_complete_model" value="false"/>
        <parameter key="average_performances_only" value="true"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_validations" value="30"/>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1985"/>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="quality != mediocre"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="5.3.015" expanded="true" height="76" name="Naive Bayes" width="90" x="179" y="30">
            <parameter key="laplace_correction" value="true"/>
          </operator>
          <connect from_port="training" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="76" name="Multiply" width="90" x="179" y="30"/>
          <operator activated="true" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples (3)" width="90" x="112" y="210">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="confidence(bad)&lt;0.999 &amp;&amp; confidence(good)&lt;0.99"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <operator activated="true" breakpoints="after" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples (2)" width="90" x="246" y="255">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="confidence(bad)&lt;0.999 &amp;&amp; confidence(good)&lt;0.99"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="true" breakpoints="after" class="replace" compatibility="5.3.015" expanded="true" height="76" name="Replace (3)" width="90" x="246" y="165">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="prediction(quality)"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="replace_what" value="bad|good"/>
            <parameter key="replace_by" value="mediocre"/>
          </operator>
          <operator activated="true" class="union" compatibility="5.3.015" expanded="true" height="76" name="Union" width="90" x="447" y="210"/>
          <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="|quality|prediction(quality)"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance" width="90" x="514" y="30">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="false"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Filter Examples (3)" to_port="example set input"/>
          <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
          <connect from_op="Filter Examples (3)" from_port="original" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Union" to_port="example set 2"/>
          <connect from_op="Replace (3)" from_port="example set output" to_op="Union" to_port="example set 1"/>
          <connect from_op="Union" from_port="union" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve DataSet-WhiteWine" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Discretize" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Through a bit of debugging the operators I found out that if you just add an "Append" operator with only one input (the actual output of "apply model" nothing else) in the testing phase of X-Validation the confusion matrix yields wrong values for true <classname>.
In the above process I first used "Append" and then changed it to the "Union" operator, however I am still having the same problem.

Am I doing anything wrong?

Thanks in advance for your help!!!

MartinLiebig · November 2014

Hello Muhammad,

I've created an example process with the iris data set where i learn on two classes and assign the "unsure" predictions (between 0.3 and 0.7) to the third


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="6.1.000" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.000" expanded="true" height="112" name="Validation" width="90" x="246" y="30">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="6.1.000" expanded="true" height="94" name="Filter Examples" width="90" x="45" y="30">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="label.does_not_equal.Iris-versicolor"/>
            </list>
          </operator>
          <operator activated="true" class="random_forest" compatibility="6.1.000" expanded="true" height="76" name="Random Forest" width="90" x="179" y="30">
            <parameter key="number_of_trees" value="25"/>
          </operator>
          <connect from_port="training" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Random Forest" to_port="training set"/>
          <connect from_op="Random Forest" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.0.000" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="rename_by_replacing" compatibility="6.1.000" expanded="true" height="76" name="Rename by Replacing" width="90" x="179" y="165">
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="replace_what" value="\(|\)|-"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="6.1.000" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="165">
            <list key="function_descriptions">
              <parameter key="predictionlabel" value="if((confidenceIrissetosa &gt; 0.2 &amp;&amp; confidenceIrissetosa &lt;0.8),&quot;Iris-versicolor&quot;,predictionlabel)"/>
            </list>
          </operator>
          <operator activated="true" class="rename_by_replacing" compatibility="6.1.000" expanded="true" height="76" name="Rename by Replacing (2)" width="90" x="447" y="165">
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="replace_what" value="predictionlabel"/>
            <parameter key="replace_by" value="prediction(label)"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.000" expanded="true" height="76" name="Performance" width="90" x="581" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Rename by Replacing" to_port="example set input"/>
          <connect from_op="Rename by Replacing" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Rename by Replacing (2)" to_port="example set input"/>
          <connect from_op="Rename by Replacing (2)" from_port="example set output" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

This works for me quite well. I hope you can use this as a template

Best,

Martin

Muhammad · November 2014

Hi Martin,

thanks for your reply. Could you please elaborate on your process, i.e. why is at necessary to rename the attributes which where generated by RapidMiner itself?

Also, I tried to adopt your approach to my problem. However, I get same issue.

I found out, that it somehow is related to the "Append" operator.

I created an example using the Weighting data., If you look at this process, please:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve Weighting" width="90" x="45" y="120">
        <parameter key="repository_entry" value="//Samples/data/Weighting"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.3.015" expanded="true" height="112" name="Validation" width="90" x="179" y="120">
        <process expanded="true">
          <operator activated="true" class="naive_bayes" compatibility="5.3.015" expanded="true" height="76" name="Naive Bayes" width="90" x="45" y="30"/>
          <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.015" expanded="true" height="76" name="Append" width="90" x="179" y="75"/>
          <operator activated="true" class="performance_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance" width="90" x="313" y="120">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Weighting" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

You will see an "Append"-Operator in the Training-Phase which only has one input - hence it shouldn't do anything. However, if you compare the confusion matrix of the process with and without the "Append"-Operator you will notice a difference.
The correct confusion matrix (in terms of the amount of true positives and true negatives ) is the one of the process without the "Append"-Operator. The other one yields a wrong number of total true positives and true negatives.

Any idea why? Also, what do I need to do to use the Append-Operator on a data set with in total about 5000 data points?

Thanks,
Muhammad

MartinLiebig · November 2014

Hi,

the Append operator is modifing the meta data.. Thus there are some changes - but i am currently not sure how it effects the performance operator

Regarding my process:
Generate attributes can not handle attributes with brackets, minus,plus or whitespaces, because they are interpreted as part of the formula, thus i needed to replace them.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Append-Operator in Testing Phase of X-Validation changes confusion mattrix"

Answers