The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Forward Selection not as expected
Dear all,
I am trying to select a subset of features from my DB using the Forward Selection. However, the Forward Selection does not behave as I expect it to do. I am attaching an example process for which I used the Iris data set in order to make it possible for everybody to reproduce what I am doing:
The process would be as follows:
...and my result which I logged to "U:\temp\LoggingOfDemoProcessForwardselection.log" as defined in the process looks like:
# Generated by Log[com.rapidminer.datatable.SimpleDataTable]
# Performance NumberOfAttributes FeatureNames Accuracy ValidationPerformance ValidationPerformance1 ValidationPerformance2 ValidationPerformance3
0.96 1.0 a1 0.96 0.9533333333333334 0.9533333333333334 null null
0.6666666666666666 1.0 a2 0.6666666666666666 0.6399999999999999 0.6399999999999999 null null
0.5066666666666667 1.0 a3 0.5066666666666667 0.52 0.52 null null
0.92 1.0 a4 0.92 0.9333333333333333 0.9333333333333333 null null
0.96 2.0 a4, a1 0.96 0.9666666666666667 0.9666666666666667 null null
0.8933333333333333 2.0 a4, a2 0.8933333333333333 0.9066666666666667 0.9066666666666667 null null
0.96 2.0 a4, a3 0.96 0.9466666666666667 0.9466666666666667 null null
0.9333333333333333 3.0 a4, a3, a1 0.9333333333333333 0.9533333333333334 0.9533333333333334 null null
0.9866666666666667 3.0 a4, a3, a2 0.9866666666666667 0.9733333333333334 0.9733333333333334 null null
0.9733333333333334 4.0 a4, a3, a1, a2 0.9733333333333334 0.9666666666666667 0.9666666666666667 null null
I am trying to understand what is happening, so I included all kinds of performances in the output.
Now, there are some things that are not as expected. For example: Apparently, a1 is the best (single) feature. However, a4, is instead chosen as the first feature. Could someone maybe explain if / why this is output is correct?
Thanks in advance and best regards,
I am trying to select a subset of features from my DB using the Forward Selection. However, the Forward Selection does not behave as I expect it to do. I am attaching an example process for which I used the Iris data set in order to make it possible for everybody to reproduce what I am doing:
The process would be as follows:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="parallelize_main_process" value="false"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="6.0.008" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="300">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="optimize_selection_forward" compatibility="6.0.008" expanded="true" height="94" name="Forward Selection" width="90" x="313" y="300">
<parameter key="maximal_number_of_attributes" value="4"/>
<parameter key="speculative_rounds" value="10"/>
<parameter key="stopping_behavior" value="without increase"/>
<parameter key="use_relative_increase" value="true"/>
<parameter key="alpha" value="0.05"/>
<parameter key="parallelize_learning_process" value="false"/>
<process expanded="true">
<operator activated="true" class="log" compatibility="6.0.008" expanded="true" height="60" name="Log" width="90" x="380" y="390">
<parameter key="filename" value="U:\temp\LoggingOfDemoProcessForwardselectionMultinomial.log"/>
<list key="log">
<parameter key="Performance" value="operator.Performance.value.performance"/>
<parameter key="NumberOfAttributes" value="operator.Forward Selection.value.number of attributes"/>
<parameter key="FeatureNames" value="operator.Forward Selection.value.feature_names"/>
<parameter key="Accuracy" value="operator.Performance.value.accuracy"/>
<parameter key="ValidationPerformance" value="operator.Validation.value.performance"/>
<parameter key="ValidationPerformance1" value="operator.Validation.value.performance1"/>
<parameter key="ValidationPerformance2" value="operator.Validation.value.performance2"/>
<parameter key="ValidationPerformance3" value="operator.Validation.value.performance3"/>
</list>
<parameter key="sorting_type" value="none"/>
<parameter key="sorting_k" value="100"/>
<parameter key="persistent" value="false"/>
</operator>
<operator activated="true" class="x_validation" compatibility="6.0.008" expanded="true" height="112" name="Validation" width="90" x="380" y="120">
<description>A cross-validation evaluating a linear regression model.</description>
<parameter key="create_complete_model" value="false"/>
<parameter key="average_performances_only" value="true"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_validations" value="2"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="parallelize_training" value="false"/>
<parameter key="parallelize_testing" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="6.0.008" expanded="true" height="76" name="k-NN" width="90" x="45" y="30">
<parameter key="k" value="5"/>
<parameter key="weighted_vote" value="false"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_port="training" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="6.0.008" expanded="true" height="76" name="Apply_Model" width="90" x="45" y="30">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="6.0.008" expanded="true" height="76" name="Performance" width="90" x="179" y="30">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="free_memory" compatibility="6.0.008" expanded="true" height="76" name="Free Memory (3)" width="90" x="313" y="30"/>
<connect from_port="model" to_op="Apply_Model" to_port="model"/>
<connect from_port="test set" to_op="Apply_Model" to_port="unlabelled data"/>
<connect from_op="Apply_Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_op="Free Memory (3)" to_port="through 1"/>
<connect from_op="Free Memory (3)" from_port="through 1" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Forward Selection" to_port="example set"/>
<connect from_op="Forward Selection" from_port="attribute weights" to_port="result 1"/>
<connect from_op="Forward Selection" from_port="performance" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
...and my result which I logged to "U:\temp\LoggingOfDemoProcessForwardselection.log" as defined in the process looks like:
# Generated by Log[com.rapidminer.datatable.SimpleDataTable]
# Performance NumberOfAttributes FeatureNames Accuracy ValidationPerformance ValidationPerformance1 ValidationPerformance2 ValidationPerformance3
0.96 1.0 a1 0.96 0.9533333333333334 0.9533333333333334 null null
0.6666666666666666 1.0 a2 0.6666666666666666 0.6399999999999999 0.6399999999999999 null null
0.5066666666666667 1.0 a3 0.5066666666666667 0.52 0.52 null null
0.92 1.0 a4 0.92 0.9333333333333333 0.9333333333333333 null null
0.96 2.0 a4, a1 0.96 0.9666666666666667 0.9666666666666667 null null
0.8933333333333333 2.0 a4, a2 0.8933333333333333 0.9066666666666667 0.9066666666666667 null null
0.96 2.0 a4, a3 0.96 0.9466666666666667 0.9466666666666667 null null
0.9333333333333333 3.0 a4, a3, a1 0.9333333333333333 0.9533333333333334 0.9533333333333334 null null
0.9866666666666667 3.0 a4, a3, a2 0.9866666666666667 0.9733333333333334 0.9733333333333334 null null
0.9733333333333334 4.0 a4, a3, a1, a2 0.9733333333333334 0.9666666666666667 0.9666666666666667 null null
I am trying to understand what is happening, so I included all kinds of performances in the output.
Now, there are some things that are not as expected. For example: Apparently, a1 is the best (single) feature. However, a4, is instead chosen as the first feature. Could someone maybe explain if / why this is output is correct?
Thanks in advance and best regards,
0
Answers
Connect the Log operator between the Validation operator and the output of the Feature Selection operator. The logging is being done before the validation at the moment so it may be difficult to interpret.
When I run it, I get a4 as the best performing attribute from the Validation step followed by a4+a3 then a4+a3+a1
regards
Andrew