Cross Validation or Split Validation

Arupriya_Sen · June 2019

Which operator is better to use, Cross validation or Split Validation , in Rapidminer?

IngoRM · June 2019

What sounds like an easy question does not have an easy answer. Many will say "go with cross validation by all means", but I see this a bit more differentiated. Below are some thoughts, but if you want to read a more in-depth discussion I would suggest to check out this white paper here: https://rapidminer.com/resource/correct-model-validation/

There are really three important things when it comes to validation (the third one covers your specific question):

Never ever calculate training errors. They are only misleading and not useful for model valuation at all. Check out a white paper to see some easy examples for this BTW. I assume you know this, but I point it out again mainly for point (2) below.
Treat preprocessing as part of the model building. Many data preprocessing steps (e.g. normalization, feature engineering, encoding and many more) will take into account the complete data set for their calculations and target model improvement. Because of that, those operations actually become part of the model building. The same is true for parameter optimization and all other model-optimizing techniques. Following rule (1) above, you have to do a proper validation of those techniques on a test set as well, otherwise your model will look better in validation than it actually will perform in production. This topic is more complex and I highly recommend to check out the white paper for more information.
Pick the validation scheme most appropriate for your situation. We have established that you need an independent test and that you need to validate (most of) your preprocessing. This leads us to the problem of picking a good validation scheme out of the many existing ones: split validation, multiple hold-out sets, cross validation, leave-one-out, bootstrapping etc. In practice, I recommend the following three techniques depending on your situation:
1. Split validation: use this if your data is very large, training times are very long, you do not have a lot of time, you perform very complex preprocessing processes with multiple nested steps, the use case you are working on is not that critical so that you can accept some uncertainty about the robustness of your models.
2. Cross validation: use this if you want to get the most thoroughly tested models, your data is small, your processes are not very complex so that you can easily embed them in one or multiple nested cross validations, total runtime is not an issue for you, the use case is life-or-death important.
3. Split validation with a robust multiple hold-out set validation: good compromise between both approaches which delivers estimation qualities similar to those of cross validations without the drawback of their long runtimes, also works with more complex process designs. The quality of the validations is good enough for most business applications.

One more word on technique (3) above: the idea is that you split off a validation set of some decent size (typically around 40% of the data) and then perform all preprocessing (where necessary) and model building on the training part (the other 60% or so) as usual. But for the actual validation, you then perform a multiple non-overlapping hold-out set validation on those 40% you left out before. In addition, you skip the outliers in the outcome of the hold-out runs. That means that you build, let's say, 7 hold-out sets out of the 40% validation set, calculate the error rates on those 7 sets, skip the ones with highest and lowest error rates, and finally compute average and standard deviations on the rest. This means that you will get some of the main advantages of cross validation like reducing the dependency on a specific hold-out sample, getting a standard deviation as an indicator for model robustness etc. But since scoring is typically fast and training is slow, you also get the biggest advantage of split validation, which is shorter runtimes. As I mentioned before, this technique also deals better with complex process designs.

This third technique is much less known than the other two, but it delivers about the same quality as cross validations with significant lower runtimes. In our tests, the differences in the estimations with this technique and cross validation was statistically not significant in 19 out of 20 data sets. For all those reasons, we are also using this technique in Auto Model for the model validations.

Final comment: I hope the guidance about when to use which validation technique is helpful, but personally I even think that this question is less important than being thorough with validating EVERYTHING correctly (point (2) above), including the preprocessing. If you do this wrong, the differences of error rates you see in the validated results and what you actually get in production can easily be 10x higher than the differences caused by the different validation schemes.

Hope this helps,
Ingo