How to plot Stability and/or Accuracy versus number of features?
Hi all,
I would like to plot the Stability of a feature selection operator as a function of the number of features (I would like to reproduce Fig. 6 of the attached .pdf, which I believe is useful for the community). For instance, I can use the "Feature Selection Stability Validation" operator that comes with the Feature Selection Extension. Inside this operator, I could use any other feature selection operator, e.g., "MRMR-FS" or "SVM-RFE". Then I would like to plot the stability of the feature selection against the number of features. I believe, this would give me a better feeling for the number of features to keep for further processing and modelling.
The same idea could be used to plot any performance metric, or runtime, or etc, against the number of features, a sort of "Learning curve" but instead of the number of examples, we use the number of features.
I hope the question is clear enough and I thank you all for your input.
Merci,
Amaury
Best Answer
-
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi Amaury,
In there you have use the Sonar data set and NB classifier. For some basic tests, I see that the results for the Pareto Front will depend on which classifier you used inside the Validation operator.
That is correct. I think this is actually something positive since the feature weighting / importance and the question if this feature should be used or not is then a good fit to the model itself. Which typically leads to better accuracies. This is called "wrapper approach" by the way. If you would filter attributes out without taking the specific model into account, we call this "filter approach". the wrapper approach in general delivers better results but needs longer runtimes for model building and validation.
My problem consists of around 800 examples and 2000 attributes. I have built a process where I use a "Select Subprocess" and inside of it I have different "Optimize Grid" operators containing different classifiers (e.g, LogReg, RandomForest, SVM etc). After this long run, I compare the ROC's for the different classifiers obtained with the the best set of parameters found by the "Optimize Grid" operators.
That makes sense. You could in theory wrap the whole model building and validation process into the MO feature selection but this might run for a long time. An alternative is to optimize the model selection and parameter optimization on all features beforehand and then only use the best model so far inside the MO feature selection. Or you could first filter some features out (filter approach), then optimize the model / parameters, and then run the MO FS. There is really no right or wrong here in my opinion. I personally use an iterative approach most of the times. Filter some features out. Find some good model candidates. Optimize parameters a little bit. Run a feature selection. Then optimize parameters further and so on...
Hope this helps,
Ingo
1
Answers
hello @meloamaury - I'm tagging @mschmitz and @IngoRM in hopes they may be able to help.
Scott
Hi @meloamaury,
Although I am sure you could build such a process, I would like to recommend an alternative approach to you: did you consider to generate such a plot with a multi-objective feature selection? The big advantage is that you are not running into local extrema while you are adding features but the feature compositions can (and actually will) change for different feature set sizes. I find this much more useful in most practical applications to be honest.
If you are interested, this blog post might be for you:
https://rapidminer.com/multi-objective-optimization-feature-selection/
There will also be a webinar on this topic on Jan 24th and will be announced here soon:
https://rapidminer.com/resources/events-webinars/
Cheers,
Ingo
I knew I tagged the right people. Thanks, @IngoRM.
Scott
Hi @meloamaury,
doesn't the FS extension provide performance measures for stability, e.g. Jaccard index? I did this with this extension in my PhD thesis. You basically do a optimize + FS and this performance operator and are done.
Let me know if it works. I am not at my working computer, so I can't provide a process yet.
Best,
Martin
Dortmund, Germany
Hi @mschmitz,
Thanks for your input. Yes, there is the "Performance (MRMR)" operator. However, from what I understood of the .pdf I attached in my message, if we use the "Feature Selection Stability Validation" operator with any FS operator, say "MRMR-FS" inside of it, this gives us already an averaged Stability, but a single value, not a curve as a function of the number of attributes. If you could please send me the process you mention, I might be able to undestand better your suggestion.
Ingo's suggestion on multi-objective-feature-selection is interesting, although I am still not sure if it will depend strongly on the classifier you use to select the features.
Hi @meloamaury.
attached is an example process. I think this is how Ben, the author thought it should be used..
Best,
Martin
Dortmund, Germany
Hi @IngoRM,
I have not thought about that actually and of course makes sense. I read your blog post and it is very interesting (I regret a lot for not being active in this community from the beginning ). I have a question regarding your multi-objective feature selection process.
In there you have use the Sonar data set and NB classifier. For some basic tests, I see that the results for the Pareto Front will depend on which classifier you used inside the Validation operator. And also, each of these classifiers has their own set of parameters one should tune with a "Optimize Grid" operator.
My problem consists of around 800 examples and 2000 attributes. I have built a process where I use a "Select Subprocess" and inside of it I have different "Optimize Grid" operators containing different classifiers (e.g, LogReg, RandomForest, SVM etc). After this long run, I compare the ROC's for the different classifiers obtained with the the best set of parameters found by the "Optimize Grid" operators.
But before all this, I am doing some crude feature selection with "MRMR-FS" where I choose a fix number of attributes to pass to the "Select Subprocess". In this step, I would like to use a robust approach like the one you suggested. Thats where I am concerned, because the "multi-objective-feature-selection" will already depend on the classifier and on its parameters, that I find just after I did the feature selection.
Could you please let me know what you think?
Thanks very much indeed!
Amaury
Hi @IngoRM,
Thanks a lot for your answers/suggestions, very much appreciated. I will try different schemas and see how they perform, indeed the run time is going up considerably but I think it is still manageble.
Merci!
Amaury
Hi @IngoRM,
Sorry to disturb you again. But you mention that:
I personally use an iterative approach most of the times. Filter some features out. Find some good model candidates. Optimize parameters a little bit. Run a feature selection. Then optimize parameters further and so on...
Would you have a RM process that do this iterative approach automatically? I am very curious to know how would you build such a process. If you have it and if you can share it with me I will try to modify for my own problem.
Thanks very much in advance!
Amaury
No, unfortunately not. I am sure one could build such a process. But actually I prefer to see the intermediate results and base some detailed decisions on those which will adapt the next steps of the iterative process. That's why I keep the parts separated. I also often find that you need to make quite a lot of adaptions for data prep for each phase of the process since data is always somewhat different :-)
On a related note: I will do a meetup in NYC next week on the topic of multi-objective feature selection. Details are here: https://www.meetup.com/RapidMiner-Data-Science-Machine-Learning-MeetUp-New-York/events/245644467/
I also will do a webinar on the same topic on January 24th 2018. Details are here: https://rapidminer.com/resource/webinar-better-machine-learning-models-multi-objective-optimization/
Cheers,
Ingo
Thanks a lot for the info. Yes, I am registered for the webminar.
Best,
Amaury