The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Order of Performing nested K-fold cross validation
thomas_gadd7
Member Posts: 1 Learner I
I have been looking at the following tutorial on correct model validation:
I'm looking at the section on contamination through feature selection when doing K-fold cross validation. In the section on Accidental Contamination, near the bottom in example 3), it is suggesting to use nested K-fold validation to search for features in a similar way to that which is being suggested in example 2) for the choice of hyperparameters.
My question is: Is there any best practice around whether to do the nested k-fold validation for feature selection first, then to use the selected features for the nested validation on the hyperparameters, or vice versa? I am imagining it will be too computationally expensive to nest all 3 techniques within one another.
Can anyone advise on this?
Thank you
Tagged:
1
Answers
That's pretty great question, I would also like to see an example of proper multi-level nested validation process in case all steps are needed at once:
@mschmitz ?
Vladimir
http://whatthefraud.wtf
In practice, I don't think many people are putting parameter optimization inside cross-validation. It's just too time consuming. I'd be quite comfortable with a setup where normalization and feature selection occurred within cross-validation, and then the results of that process were fed to an optimization process where cross-validation for model training was occuring inside the parameter optimization operator.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
This is a great question and I remember we had this discussion elsehwere in the threads here. I agreewwith what @Telcontar120 says.
Thanks @Telcontar120 @Thomas_Ott
Though I have one really stupid question at this point, as I am a bit dumb today
If we normalize or perform feature selection within k-fold x-Validation, this is done k+1 times in total if I remember correctly from Martin's explanation somewhere else: k times (one for each fold) + one more time for full dataset, right? At the same time logic tells me that on each fold we might have slightly different normalization or feature selection?
So far, how do we pull out the preprocessing model out of x-Validation in this case? Just by taking the latest one? My concern is that the same preprocessing model should also be applied on a test set and also propagated to production process (if there's any).
Vladimir
http://whatthefraud.wtf
Correct, with k-fold cross validation, there are k+1 runs, where the final run is on the entire dataset and that is the result that is returned for any model. But conceptually the cross-validation is simply a way to estimate the reliability of your results on unseen data (to avoid overfitting), and as Ingo's post has shown, when you do things like normalization and other preprocessing inside the cross-validation, you get a more realistic view of what your eventual performance would be like. But when you actually go to construct your normalization model or other preprocessing models, that should be performed using the entire dataset.
Feature selection is similar, only there is no prepocessing model that is returned, just a smaller set of attributes that will be used in the final model. And of course a predictive model itself is returned directly from the cross-validation output (once again, the one built on the entire dataset and not any of the individual k folds).
I hope this clarifies!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks @Telcontar120 you have returned me my sanity this is the way I actually do, just decided to double-check because dumb day that's why
Vladimir
http://whatthefraud.wtf
@kypexin it's a complex rabbit hole but exactly what @Telcontar120 said when it comes to k+1, it's the entire dataset with the average performances for each 'k'.
When I used to teach the RM traning course, this topic (e.g. normalizing inside the X-val) would cause my student's heads to smoke.