Interpretation of labeled data after cross-validation
Dear all,
I am having trouble interpreting the exported labeled data of the cross-validation operator. Nested inside it are either a regression model or a neural net model (we are trying to compare performance).
However, using this method (through the 3rd output port of the cross-validation, test), there is an output of the actual and the predicted value for all rows in the dataset.
Are these predictions being iteratively generated during the folds (and thus each based on a different model) or are they the result of the best performing model being ran on the entire set?
I hope you can clarify this, and also that is has not been answered many times already. Did perform a search but could not find this in the forums.
Thanks a lot in advance.
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi,
it is this:
Are these predictions being iteratively generated during the folds (and thus each based on a different model)
All other things are not possible. Keep in mind that X-Validation is not returning "the best" model as a result, but the model which is built on the full data set. You cannot apply this to the data. You also can not the result of "the best" model to the full data, because part of this would be in the training set.
Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Dear Martin,
That is already very clarifying. However, I keep having difficulties with what then the output of the model port is. You describe it as the model that is being built on the full dataset. This is also what the documentation states. But is the meaning of this that after the 10 folds (used for calculating the average performance) it does another iteration of training on the full dataset and testing on the (same) full dataset?
Also thank you very much for replying so quickly.
Yes, that is correct. Cross Validation will iterate over 10 randomly selected subsets of data (if k = 10) and then do a full training on the entire dataset and deliver that model to the MOD port.
However, for clarity, the model output is trained on the full dataset but the reported performance is not on the full model, but rather the average performance across the k-folds of the cross validation. There would be no way to train a model on the full dataset and also report the performance on the full dataset using a separate test sample (since there would be no records left for it).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi Thomas & Brian,
Now it is completely clear to me. Or well, at least what is being output. Pretty much always have trouble with judging the 'validity' of implementing a certain step, but that's for another time.
Really appreciated the help!