The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Feature Selection within CV: Which features are finally selected?
Dear All,
Coming back to a topic that was attempted to be answered in the past, but as far as I'm concerned I didn't got a clear answer. Lets consider that we have 20 features A1, A2, A3,... A20 and we perform LASSO (optimizing lambda, and having alpha=1) with a LogReg model, and we do that according to the suggested best practices to reduce accidental exposure of the labels, within a K-fold CV operator. This is done K+1 times, K times for each individual fold and 1 time considering the total data set (that means that there is no data splitting into train+test in that case). And lets assume that for each fold the features with non-zero coefficients are different (A1, A3 and A5 for K=1, A2, A3, A20, for K=2, .... A5, A12, A15 for the whole data set). The final model is using the features that were selected when considering the total data set? If yes then this model performance is not corresponding to the output of the CV operator that averages the performance across all folds. Is that correct?
Many thanks in advance,
Nikos
Coming back to a topic that was attempted to be answered in the past, but as far as I'm concerned I didn't got a clear answer. Lets consider that we have 20 features A1, A2, A3,... A20 and we perform LASSO (optimizing lambda, and having alpha=1) with a LogReg model, and we do that according to the suggested best practices to reduce accidental exposure of the labels, within a K-fold CV operator. This is done K+1 times, K times for each individual fold and 1 time considering the total data set (that means that there is no data splitting into train+test in that case). And lets assume that for each fold the features with non-zero coefficients are different (A1, A3 and A5 for K=1, A2, A3, A20, for K=2, .... A5, A12, A15 for the whole data set). The final model is using the features that were selected when considering the total data set? If yes then this model performance is not corresponding to the output of the CV operator that averages the performance across all folds. Is that correct?
Many thanks in advance,
Nikos
Tagged:
0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @npapan69
Yes, the final model is trained on whole data set and feature selection is also done based on whole data. CV is to check model performance in all scenarios based on changes in different data point. If you really want to test the final fully trained model performance, you can set aside a hold out data set and apply the cross validation output to check how the model is performing on holdout data set.
Hope this helpsRegards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
2 -
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderOk, let's try differently then (and sorry when I got way back to the basics on this one...)Let's assume that you have a data set with 1,000 rows. You want to build a ML on this data and put it into production. Naturally, you build the model on as much data as possible (all 1,000 rows) since then the model can learn from most information which in general is a good idea.But not so fast: how can you be sure how well the model will perform for NEW data in the future? That's why we validate on test data (at least with a split validation, but if circumstances allow you probably go with a cross-validation. Great, 90% accurate, let's put the model into production.Please note, that I did NOT say anything optimizing the model here, the only purpose of the validation is to estimate the future performance of the ORIGINAL model built on all 1,000 rows.For this reason, RapidMiner offers the "mod" (short for Model) port at the cross validation so that you can make sure that the validated model and the model built on the complete data is using the same hyperparameters. This is just a convenience feature since this is such a frequent pattern but again has nothing to do with any attempt of improving the model as part of the validation.So now let's add feature selection to this discussion. The argument is the very same as above, but this time we replace "Model" with "Feature Selection and Model" in the flow:Let's assume that you have a data set with 1,000 rows. You want to build a FS + ML on this data and put it into production. Naturally, you build the FS + model on as much data as possible (all 1,000 rows) since then the FS + model can learn from most information which in general is a good idea.
But not so fast: how can you be sure how well the FS + model will perform for NEW data in the future? That's why we validate on test data (at least with a split validation, but if circumstances allow you probably go with a cross-validation. Great, 90% accurate, let's put the FS + model into production.You see? The feature selection becomes a part of the model building. It is validated in exactly the same way as a model without FS. Exactly for the reason you have mentioned since you would have information leakage if you do it before.But after validation, and for the same reasons as for a model without FS, you also build the model AND the FS on the complete data for the production model in the end. Again, the whole point of the validation is to estimate how well it works, not to find good feature sets, parameters or anything else.Hope this helps,
Ingo11
Answers
Ingo
Now I'm confused, can you elaborate more on the concept of using all the data to build a model including feature selection (for example LASSO)? What I mean is that having done the feature selection prior or outside the CV operator leads to accidental leakage of the labels and therefore overoptimistic performance, but what about doing the feature selection inside the CV if the final model is builded using all data? What happens then with leaking the labels for selecting the features if you use all the data and not separating training and testing like you do for each fold?