The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Trying to understand Feature Selection"
I'm rather new to this whole Machine Learning thing and to RapidMiner specifically, and I'm having a bit of trouble understanding how Feature Selection works. I was wondering if a more experienced RM user would be willing to help me out.
My input is a list of 120 vectors containing 200 features each and tagged with one of 4 classes. Classification performance with Naive Bayes and 10-fold CV is 87.50%.
In an effort to improve this score further, I tried applying (backward) Feature Selection to the vectors first. This improved my score to 92.50%, which made me happy.
I then wanted to find out which features had been selected exactly to see if it would tell me anything about my data, so I added an AttributeWeightsWriter to my process. The full process looks like this:
But when I remove the one feature from my vectors prior to classification (either manually or by using AttributeWeightsLoader > AtrributeWeightsApplier) I only get a score of 87.50%, which is my score wihtout Feature Selection. So what's going on here? FS is obviously being much more active than just turning off one single feature. How do I find out which features it's been using so I can re-produce the results?
Thanks for your help.
My input is a list of 120 vectors containing 200 features each and tagged with one of 4 classes. Classification performance with Naive Bayes and 10-fold CV is 87.50%.
In an effort to improve this score further, I tried applying (backward) Feature Selection to the vectors first. This improved my score to 92.50%, which made me happy.
I then wanted to find out which features had been selected exactly to see if it would tell me anything about my data, so I added an AttributeWeightsWriter to my process. The full process looks like this:
<?xml version="1.0" encoding="windows-1252"?>And this is the part where I'm stumped: once the process finishes running and I examine the weights in the performance screen or in the .wgt file, I notice only ONE feature gets a weight of 0 while ALL others remain at 1. This still seems to give me the score of 92.50% I mentioned before.
<process version="4.5">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="...\200.aml"/>
<parameter key="sample_ratio" value="1.0"/>
<parameter key="sample_size" value="-1"/>
<parameter key="permutate" value="false"/>
<parameter key="decimal_point_character" value="."/>
<parameter key="column_separators" value=",\s*|;\s*|\s+"/>
<parameter key="use_comment_characters" value="true"/>
<parameter key="comment_chars" value="#"/>
<parameter key="use_quotes" value="true"/>
<parameter key="quote_character" value="""/>
<parameter key="quoting_escape_character" value="\"/>
<parameter key="trim_lines" value="false"/>
<parameter key="skip_error_lines" value="false"/>
<parameter key="datamanagement" value="int_sparse_array"/>
<parameter key="local_random_seed" value="-1"/>
</operator>
<operator name="FS" class="FeatureSelection" expanded="yes">
<parameter key="normalize_weights" value="true"/>
<parameter key="local_random_seed" value="-1"/>
<parameter key="show_stop_dialog" value="false"/>
<parameter key="user_result_individual_selection" value="false"/>
<parameter key="show_population_plotter" value="false"/>
<parameter key="plot_generations" value="10"/>
<parameter key="constraint_draw_range" value="false"/>
<parameter key="draw_dominated_points" value="true"/>
<parameter key="maximal_fitness" value="Infinity"/>
<parameter key="selection_direction" value="backward"/>
<parameter key="keep_best" value="1"/>
<parameter key="generations_without_improval" value="1"/>
<parameter key="maximum_number_of_generations" value="-1"/>
<operator name="FSChain" class="OperatorChain" expanded="yes">
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="false"/>
<parameter key="create_complete_model" value="false"/>
<parameter key="average_performances_only" value="true"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_validations" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="local_random_seed" value="-1"/>
<operator name="KernelNaiveBayes" class="KernelNaiveBayes">
<parameter key="keep_example_set" value="false"/>
<parameter key="laplace_correction" value="true"/>
<parameter key="estimation_mode" value="greedy"/>
<parameter key="bandwidth_selection" value="heuristic"/>
<parameter key="bandwidth" value="0.1"/>
<parameter key="minimum_bandwidth" value="0.1"/>
<parameter key="number_of_kernels" value="10"/>
<parameter key="use_application_grid" value="false"/>
<parameter key="application_grid_size" value="200"/>
</operator>
<operator name="ApplierChain" class="OperatorChain" expanded="yes">
<operator name="Applier" class="ModelApplier">
<parameter key="keep_model" value="false"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="false"/>
</operator>
<operator name="Evaluator" class="Performance">
<parameter key="keep_example_set" value="false"/>
<parameter key="use_example_weights" value="true"/>
</operator>
</operator>
</operator>
<operator name="ProcessLog" class="ProcessLog">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
<parameter key="sorting_type" value="none"/>
<parameter key="sorting_k" value="100"/>
<parameter key="persistent" value="false"/>
</operator>
</operator>
</operator>
<operator name="AttributeWeightsWriter" class="AttributeWeightsWriter">
<parameter key="attribute_weights_file" value="...\200.wgt"/>
</operator>
</operator>
</process>
But when I remove the one feature from my vectors prior to classification (either manually or by using AttributeWeightsLoader > AtrributeWeightsApplier) I only get a score of 87.50%, which is my score wihtout Feature Selection. So what's going on here? FS is obviously being much more active than just turning off one single feature. How do I find out which features it's been using so I can re-produce the results?
Thanks for your help.
Tagged:
0
Answers
could you please post the resulting ProcessLog here? it would help me a lot...
Greetings,
Sebastian
Looking at the log it looks like it achieves the 92.5% score after only one generation. If it's the case that FS turns off only one attribute per generation, then getting only one attribute with weight 0 makes sense. What I don't understand then is that when I load the weights (with the disabled feature) and run the classification task again with the improved feature set, I don't get the score of 92.50% I was expecting...
unfortunately my guess is might be a little bit unsatisfying for you: It smells like a bug in the FeatureSelection... Could you do me the favor of logging the XValidation's performance instead of the FeatureSelections Performance? This would be quite nice, because it won't return always the same best performance.
Greetings,
Sebastian
I had to put the text in pastebin because otherwise my post would exceed the allowed character limit. Just follow this link:
http://pastebin.com/m7a713e0f
The XML file looks like this now:
that was not exactly what I wanted Instead I need the performance values of each single application of the XValidation. The XValidation is run each time an attribute is removed, so the performance should change each time. Here's the modified process, doing exactly what I want to know: Greetings,
Sebastian
I see that the algorithm reaches its highest score in generation 1, so it would make sense that only one feature is disabled. However, I can't reproduce these results manually. Maybe I'm trying to reproduce them in the wrong way? What I tried is this:
- Run the above experiment, which saves the weights to a file, with AttributeWeightsWriter.
- Create a new project that loads the examples and applies the weight file to it: The resulting score is not the same as what's being reported by the FS algorithm though. It's 87.5% (the original score before FS) instead of 92.5%...
Is this the correct way of writing and applying the feature selection weights?
the performance estimated by an cross-validation not only depends on the used data, the algorithm and its parameters, but also on the splitting of the data into the folds. It might be, that the random splitting of examples differs in a way, so that one result is better than the other. To avoid this, you will have to use global random seeds, so that the operator always gets the same sequence of random numbers.
It is not a good idea to use this in a general setting, because you could overfit your parameters to this one single sequence of random numbers during parameter optimization, but you could use it for testing here.
Another point is that the feature selection stops after the second generation, because the performance did not increase. To continue anyway, you could increase the "generations_without_improval" parameter.
Greetings,
Sebastian