The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Cross Validation # of k's
Hey guys!
Just wondering is there any guidelines on how many validation (k's) is to be performed when doing cross-validation?
Let's say I have 100k data, how many is said to be enough or alright?
Tagged:
0
Answers
The default setting (10) has been a consensus for a long time.
Depending on your data and the stability of your models, you could get away with less or need more.
Try different values and look for the variance of both the main performance number and the calculated variance. if these stay stable, you have enough data and stable enough models so you can go with less iterations.
I agree with @BalazsBarany that 10 folds is the default consensus, but with large datasets, you can usually get away with 5. As noted, stability of the performance is the key measure. If you have a small dataset you might consider the leave-one-out option but for larger datasets it is not at all recommended.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I agree and if you need a reference for that, then see: Ron Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, Proceedings of the 14th international joint conference on Artificial intelligence, p.1137-1143, August 20-25, 1995, Montreal, Quebec, Canada.
The choice of k is an example of the Bias-Variance trade-off present in every estimation.
The Leave-One-Out CV is the most unbiased one, but it can have a very high variance (the models trained using the same dataset but one point are highly correlated).
The CVs with decreasing value of k will tend to be more biased (overestimating) but with lower variance.
In practical terms, if the estimation of the model performance is very important you can do several CV with k ranging from 5-20, and then choose the one that has the maximum acceptable variance. If the estimation is not very important (i.e. is used only for feature selection or parameter optimization), then you can leave it at 10, or reduce to 5 if you need to do it fast.