Feature Selection within CV: Which features are finally selected?

npapan69 · July 2019

Dear All,
Coming back to a topic that was attempted to be answered in the past, but as far as I'm concerned I didn't got a clear answer. Lets consider that we have 20 features A1, A2, A3,... A20 and we perform LASSO (optimizing lambda, and having alpha=1) with a LogReg model, and we do that according to the suggested best practices to reduce accidental exposure of the labels, within a K-fold CV operator. This is done K+1 times, K times for each individual fold and 1 time considering the total data set (that means that there is no data splitting into train+test in that case). And lets assume that for each fold the features with non-zero coefficients are different (A1, A3 and A5 for K=1, A2, A3, A20, for K=2, .... A5, A12, A15 for the whole data set). The final model is using the features that were selected when considering the total data set? If yes then this model performance is not corresponding to the output of the CV operator that averages the performance across all folds. Is that correct?
Many thanks in advance,
Nikos

varunm1 · July 2019

Hello @npapan69

Yes, the final model is trained on whole data set and feature selection is also done based on whole data. CV is to check model performance in all scenarios based on changes in different data point. If you really want to test the final fully trained model performance, you can set aside a hold out data set and apply the cross validation output to check how the model is performing on holdout data set.

Hope this helps

IngoRM · July 2019

Ok, let's try differently then (and sorry when I got way back to the basics on this one...)

Let's assume that you have a data set with 1,000 rows. You want to build a ML on this data and put it into production. Naturally, you build the model on as much data as possible (all 1,000 rows) since then the model can learn from most information which in general is a good idea.

But not so fast: how can you be sure how well the model will perform for NEW data in the future? That's why we validate on test data (at least with a split validation, but if circumstances allow you probably go with a cross-validation. Great, 90% accurate, let's put the model into production.

Please note, that I did NOT say anything optimizing the model here, the only purpose of the validation is to estimate the future performance of the ORIGINAL model built on all 1,000 rows.

For this reason, RapidMiner offers the "mod" (short for Model) port at the cross validation so that you can make sure that the validated model and the model built on the complete data is using the same hyperparameters. This is just a convenience feature since this is such a frequent pattern but again has nothing to do with any attempt of improving the model as part of the validation.

So now let's add feature selection to this discussion. The argument is the very same as above, but this time we replace "Model" with "Feature Selection and Model" in the flow:

Let's assume that you have a data set with 1,000 rows. You want to build a FS + ML on this data and put it into production. Naturally, you build the FS + model on as much data as possible (all 1,000 rows) since then the FS + model can learn from most information which in general is a good idea.
But not so fast: how can you be sure how well the FS + model will perform for NEW data in the future? That's why we validate on test data (at least with a split validation, but if circumstances allow you probably go with a cross-validation. Great, 90% accurate, let's put the FS + model into production.

You see? The feature selection becomes a part of the model building. It is validated in exactly the same way as a model without FS. Exactly for the reason you have mentioned since you would have information leakage if you do it before.

But after validation, and for the same reasons as for a model without FS, you also build the model AND the FS on the complete data for the production model in the end. Again, the whole point of the validation is to estimate how well it works, not to find good feature sets, parameters or anything else.

Hope this helps,
Ingo

npapan69 · July 2019

Thank you varunm1 for your fast response, regarding lambda optimization when you have an Optimize operator and inside you have your CV operator, the optimization is done again for the final model? or for each fold seperately and if yes which is the optimum lambda, since each fold might have a different lambda?

IngoRM · July 2019

The model is rebuilt completely on the whole data set. The point of cross-validation is NOT to create models / shortcuts / optimizations etc. but only to estimate how well a model built on the data will perform on unseen data points. Please check the last paragraph in this article for a bit of discussion on this: https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio

Hope this helps,
Ingo

npapan69 · July 2019

Dear IngoRM,
Now I'm confused, can you elaborate more on the concept of using all the data to build a model including feature selection (for example LASSO)? What I mean is that having done the feature selection prior or outside the CV operator leads to accidental leakage of the labels and therefore overoptimistic performance, but what about doing the feature selection inside the CV if the final model is builded using all data? What happens then with leaking the labels for selecting the features if you use all the data and not separating training and testing like you do for each fold?

npapan69 · July 2019

Fantastic, now I get it. million thanks Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Feature Selection within CV: Which features are finally selected?

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers