The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Feature Selection
Hi everyone,
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance
Tagged:
0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @npapan69
The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
8 -
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHello, again!
Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").
TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.
Despite my lapsus (and understanding the question), I can now focus on this:My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?Let's see:- On each cross validation fold, the selected features will differ.
- In the RapidMiner documentation for the Cross Validation operator, it says:
Also the number of iterations that will take place is the same as the number of folds. If the model output port is connected, the Training subprocess is repeated one more time with all Examples to build the final model.
All the best,
Rodrigo.9
Answers
Answers below:
No, feature selection should be done before the cross validation process, not inside the cross validation process. What you are trying to accomplish will lead to certain example subsets having different columns, and a model that is both unpredictable and poorly trained.
Again, do you mind to share your XML to see what is happening?
All the best,
Rodrigo.