Which Validation operator should be used for model evaluation?
The Accuracy given by the Performance Vector for Split validation and Cross validation is different. Where Cross validation shows slight improvement in accuracy. Which validation operator is preferred the most in case of model evalation?
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
There are big differences on how Split and Cross Validation operator but the intent is the same, train, test, and measure performance of a model. The Cross Validation operator gives a more honest estimation on how the model would perform on unseen data sets. This is why in the accuracy measure for a CV model you might see 70.00% (+/- 5%). The +/- 5% is essentially one standard deviation of the average 70% accuracy .
Go check out Ingo's paper on model validation to learn more: https://rapidminer.com/resource/correct-model-validation/
1
Answers
Hi Thomas,
I read the article and made a simple process using the Iris data to solution the Parameter Optimization biase.
Just wanted to check if I've done the nesting for the two validations correctly.
Could you please let me know?
Thank You,
Dhruve
hi @khannadh - I saw your note but mainly wanted to tag @Thomas_Ott so that it gets his attention
So your question is a good one. Your setup was almost correct except that you need to specify the name map parameters in the Set Parameters operator:
Scott
I would be very curious what others think on this very important issue, as setups have varied over the years. @Telcontar120? @mschmitz? @yyhuang? @Pavithra_Rao?
Scott
Thanks Scott.
I appreciate the help.
The set parameter is the only step that I dont understand.
What is the operator doing exactly in that step?
@sgenzer
Also, when I set the parameter according to your screenshot, I still get a warning sign, not sure if the problem is fixed.
I've attached a screenshot.
Do you know why this is happening?
@khannadh at run time with larger datasets, this setup could become slow. I would just put the Cross Validation operator in the Optimize Parameters operator instead of the other way around. This way 10 folds will be come one paramater optimization iteration.
I made some port connections and it seems to have removed the problem.
But I'd still like to understand what exactly is going on?
I have attached the screenshot and process.
If someone could explain, that would be great.
Thank You,
Dhruve
I tend to agree with @Thomas_Ott here. While I understand the theoretical arguments (at least on some level) in favor of the double-nesting (corss-validation inside optimize paramaters inside cross validation), I don't find that in practice there is a significant difference or advantage to this solution. But as Tom says it can lead to significantly longer run times with larger data sets. I'll also point out that the double-nesting approach is not used in RapidMiner's auto-model processes either.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
ok thanks @Thomas_Ott @Telcontar120 that was my feeling as well but I appreciate the confirmation. So @khannadh, just to be crystal clear, the approach that is shown in that whitepaper is the "gold standard" but rarely used in practice due to issues pointed out above.
Now to answer your questions...
- The "Set Parameters" literally takes the input parameters on the left (the gray "par" nub) and pushes them into the parameters for another operator in your process by its name. In your process, the name of the operator to which you want to push those parameters is called "Decision Tree (2)", and your Set Parameters operator is, in your process, named "Set Parameters". Hence, in the name map, I put "Set Parameters" in the left side (under "set operator name") and "Decision Tree (2)" on the right side (under "operator name"). That's what that operator does.
- Now as @Telcontar120 and @Thomas_Ott implied, none of us really do this. To be honest, that's the first time I have used "Set Parameters" in a very long time (and I'm on RapidMiner every day). The more "normal" and much simpler way to do this (and the way that I think we all do this), is simply putting Cross-Validation inside "Optimize Parameter (Grid)". Done.
The only other thing that many of us do, to make sure the performance is a true measure, is do an initial split of the data to ensure that you are measuring performance against an unseen "testing" set. Like this:
I usually do a 70/30 split but this often depends on who's doing it, and what the data set is like.
Good luck!
Scott