Solved: How to implement sequential floating forward selection

Danyo83 · February 2013

Hi,

I have a binary classification problem with a lot of features. I don't only want ot use forward selection but a floating method which combines forward selection and backward elimination, but multiple times in a row. Is that possible since if I construct it, after the first backward elimination I cannot set the already chosen features as fix and go on with forward selection with the remaining features.

Thanks in advance

Kind regards,

Daniel

MariusHelf · February 2013

Hi Daniel,

you can use Select by Weights on the exa and wei output of the Backward Elimination operator to remove "unselected" features.

Best regards,
Marius

Danyo83 · February 2013

Hi Marius,

thanks for the reply. You are right, I can use the "select by weight" operator after the BE, however the problem is that I want the second Forward Selection to start with the last optimal subset of the previous BE: Doin this I can overcome the disadvantage of the greedy algortihms. Here an example for a process:

Consider a total set of 100 attributes.

1. Starting with a Forward selection. Let's assume it ends after 30 attributes.
2. Since there might be bad features chosen which cannot be taken away since it is a greedy approach I do BE with the 30 features. Lets say it ends up with 25 of the 30 features.
3. Now I want to do Forward selection again STARTING already with the 25 features and the resting 75 features can be selected. Let's assume it selects t further features.
4. I do BE again with the 32 features from the subset
5. and so one until there is a possibly global maximum found.

Point 3 is the problem. How can I let the FS start with already 25 features and leave the resting 75 for selection?
This is also important for the example that I start with a random set of features and do SFFS or SFBE.

Thanks in advance

Daniel

MariusHelf · February 2013

Hi Daniel,

now I got it. Step 3 is not possible out of the box, but you can work around it. Please see the attached process for reference. There, the interesting part happens after the Backward Elimination. The dataset after the BE is "remembered", then the FS is started as usual. Inside the FS, however, we "recall" the stored data and join it to the features currently tested by the FS. By checking "remove_duplicate_features" in the Join operator, we will not get duplicate attributes.

The FS, however, only includes those attributes selected by the operator itself, not the once we added artificially in its subprocess. Thus, we have to join the data again after the FS.

Hope this helps, and if you have any questions just post them here.

Best regards,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve Sonar" width="90" x="45" y="165">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.005" expanded="true" height="76" name="Generate ID" width="90" x="179" y="165"/>
      <operator activated="true" class="optimize_selection_forward" compatibility="5.3.005" expanded="true" height="94" name="Forward Selection" width="90" x="313" y="165">
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="naive_bayes" compatibility="5.3.005" expanded="true" height="76" name="Naive Bayes" width="90" x="45" y="30"/>
              <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="optimize_selection_backward" compatibility="5.3.005" expanded="true" height="94" name="Backward Elimination" width="90" x="447" y="165">
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.3.005" expanded="true" height="112" name="Validation (3)" width="90" x="45" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="naive_bayes" compatibility="5.3.005" expanded="true" height="76" name="Naive Bayes (3)" width="90" x="160" y="30"/>
              <connect from_port="training" to_op="Naive Bayes (3)" to_port="training set"/>
              <connect from_op="Naive Bayes (3)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance (3)" width="90" x="260" y="30"/>
              <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
              <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Validation (3)" to_port="training"/>
          <connect from_op="Validation (3)" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="remember" compatibility="5.3.005" expanded="true" height="60" name="Remember" width="90" x="581" y="165">
        <parameter key="name" value="data"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="94" name="Multiply" width="90" x="715" y="165"/>
      <operator activated="true" class="optimize_selection_forward" compatibility="5.3.005" expanded="true" height="94" name="Forward Selection (2)" width="90" x="849" y="75">
        <process expanded="true">
          <operator activated="true" class="recall" compatibility="5.3.005" expanded="true" height="60" name="Recall" width="90" x="45" y="120">
            <parameter key="name" value="data"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="remove_from_store" value="false"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.005" expanded="true" height="76" name="Join" width="90" x="246" y="30">
            <list key="key_attributes"/>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.3.005" expanded="true" height="112" name="Validation (2)" width="90" x="447" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="naive_bayes" compatibility="5.3.005" expanded="true" name="Naive Bayes (2)"/>
              <connect from_port="training" to_op="Naive Bayes (2)" to_port="training set"/>
              <connect from_op="Naive Bayes (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" name="Apply Model (2)">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" name="Performance (2)"/>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Join" to_port="left"/>
          <connect from_op="Recall" from_port="result" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Validation (2)" to_port="training"/>
          <connect from_op="Validation (2)" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="join" compatibility="5.3.005" expanded="true" height="76" name="Join (2)" width="90" x="983" y="165">
        <list key="key_attributes"/>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Forward Selection" to_port="example set"/>
      <connect from_op="Forward Selection" from_port="example set" to_op="Backward Elimination" to_port="example set"/>
      <connect from_op="Backward Elimination" from_port="example set" to_op="Remember" to_port="store"/>
      <connect from_op="Remember" from_port="stored" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Forward Selection (2)" to_port="example set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Join (2)" to_port="right"/>
      <connect from_op="Forward Selection (2)" from_port="example set" to_op="Join (2)" to_port="left"/>
      <connect from_op="Join (2)" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="144"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Danyo83 · February 2013

Hi Marius,

thank you for this interesting approach.
I have a few comments.
- I have changed the first selection algorithm to GA since it es easier to show it with this operator.
- The connection from the multiplier to the Forward selection is wrong, because you would give the selected set after the BE to the forward selection again, this is useless cause you have to provide additional new features. Therefore you have to deliver the original example set before the first selection. So the Forward Selection has for sure the BE selected features and the original features (which contains the BE selected features), then with the inner join operator of the FS and the final join operator after the FS it makes sure that they occur only once.

.Why do you set the generate_id operator? this does not make any sense to me since it gives only a number to every instance. but where could you lose instances in this process to make it needable to control it?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve Sonar" width="90" x="45" y="75">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.005" expanded="true" height="76" name="Generate ID" width="90" x="112" y="165"/>
      <operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="94" name="Multiply (2)" width="90" x="246" y="165"/>
      <operator activated="true" class="optimize_selection_evolutionary" compatibility="5.3.005" expanded="true" height="94" name="Optimize Selection (Evolutionary)" width="90" x="313" y="30">
        <parameter key="restrict_maximum" value="true"/>
        <parameter key="min_number_of_attributes" value="20"/>
        <parameter key="max_number_of_attributes" value="40"/>
        <parameter key="maximum_number_of_generations" value="35"/>
        <parameter key="use_early_stopping" value="true"/>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.005" expanded="true" height="76" name="k-NN" width="90" x="103" y="30">
                <parameter key="k" value="10"/>
                <parameter key="measure_types" value="NumericalMeasures"/>
                <parameter key="numerical_measure" value="ChebychevDistance"/>
              </operator>
              <connect from_port="training" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance_firstSelection" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance_firstSelection" to_port="labelled data"/>
              <connect from_op="Performance_firstSelection" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="remember" compatibility="5.3.005" expanded="true" height="60" name="Remember (2)" width="90" x="447" y="30">
        <parameter key="name" value="weight"/>
        <parameter key="io_object" value="AttributeWeights"/>
      </operator>
      <operator activated="true" class="optimize_selection_backward" compatibility="5.3.005" expanded="true" height="94" name="Backward Elimination" width="90" x="447" y="165">
        <parameter key="maximal_number_of_eliminations" value="30"/>
        <parameter key="speculative_rounds" value="10"/>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.3.005" expanded="true" height="112" name="Validation_BE" width="90" x="380" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.005" expanded="true" height="76" name="k-NN (2)" width="90" x="112" y="30">
                <parameter key="k" value="10"/>
                <parameter key="measure_types" value="NumericalMeasures"/>
                <parameter key="numerical_measure" value="ChebychevDistance"/>
              </operator>
              <connect from_port="training" to_op="k-NN (2)" to_port="training set"/>
              <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance_after_BE" width="90" x="260" y="30"/>
              <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance_after_BE" to_port="labelled data"/>
              <connect from_op="Performance_after_BE" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Validation_BE" to_port="training"/>
          <connect from_op="Validation_BE" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="remember" compatibility="5.3.005" expanded="true" height="60" name="Remember" width="90" x="581" y="75">
        <parameter key="name" value="data"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="76" name="Multiply" width="90" x="648" y="165"/>
      <operator activated="true" class="optimize_selection_forward" compatibility="5.3.005" expanded="true" height="94" name="Forward Selection (2)" width="90" x="715" y="75">
        <parameter key="maximal_number_of_attributes" value="20"/>
        <parameter key="speculative_rounds" value="10"/>
        <process expanded="true">
          <operator activated="true" class="recall" compatibility="5.3.005" expanded="true" height="60" name="Recall" width="90" x="45" y="120">
            <parameter key="name" value="data"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="remove_from_store" value="false"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.005" expanded="true" height="76" name="Join" width="90" x="246" y="30">
            <list key="key_attributes"/>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.3.005" expanded="true" height="112" name="Validation_FS" width="90" x="447" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="5.3.005" expanded="true" height="76" name="k-NN (3)" width="90" x="113" y="30">
                <parameter key="k" value="10"/>
                <parameter key="measure_types" value="NumericalMeasures"/>
                <parameter key="numerical_measure" value="ChebychevDistance"/>
              </operator>
              <connect from_port="training" to_op="k-NN (3)" to_port="training set"/>
              <connect from_op="k-NN (3)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance_after_FS" width="90" x="231" y="30"/>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_after_FS" to_port="labelled data"/>
              <connect from_op="Performance_after_FS" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Join" to_port="left"/>
          <connect from_op="Recall" from_port="result" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Validation_FS" to_port="training"/>
          <connect from_op="Validation_FS" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="join" compatibility="5.3.005" expanded="true" height="76" name="Join_final" width="90" x="849" y="75">
        <list key="key_attributes"/>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_op="Forward Selection (2)" to_port="example set"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="example set out" to_op="Backward Elimination" to_port="example set"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_op="Remember (2)" to_port="store"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="performance" to_port="result 2"/>
      <connect from_op="Backward Elimination" from_port="example set" to_op="Remember" to_port="store"/>
      <connect from_op="Backward Elimination" from_port="performance" to_port="result 3"/>
      <connect from_op="Remember" from_port="stored" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Join_final" to_port="right"/>
      <connect from_op="Forward Selection (2)" from_port="example set" to_op="Join_final" to_port="left"/>
      <connect from_op="Forward Selection (2)" from_port="performance" to_port="result 4"/>
      <connect from_op="Join_final" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="144"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

MariusHelf · February 2013

Hi, you are right with the connections. This was just a quick sketch-up without any testing - my fault.

You need the ID to be able to apply the Join operators.

Best regards,
Marius

Danyo83 · February 2013

Thanks !!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Solved: How to implement sequential floating forward selection

Answers