optimize parameter inside X-Validation does it make sense??
hi,
I noticed in the sample processes under -> Template -> Churn Modeling, that there is an optimize parameter inside a X-validation.
I know there is literature that says for X-Validation, validation data is used for parameter tuning, is this how it is meant to be done like in that example?
I am just curious, because it makes not so much sense for me to do a optimize parameter inside a X-Validation, the dataset is split, and inside opt. Parameter, the model that was build on the training data will be tested on the same training data, which should result in overfitting. The parameters will be optimized for training data, not the real data and only for a subset of the data inside X-Validation, then the best parameter model is retrieved and used inside X-Validation for applying to test data... therefore you get a different model with different parameters on each run inside the X-Validation, depending probably on how the dataset is being split. However, what one tries to get is one general model (as you probably will only have 1 model at the end and not 10 different ones) that fits best to real data.
it seems to work for that model, however I'm a bit sceptical if that is a valid / good approach for modeling, because the parameters should be optimised for the test-set, which means to be optimised for the real use-case , and not to tune overfitting on training data...
what do you think? Is this valid ? Or are both approaches valid and good modeling?
I personally would put X-Validation inside the Grid Optimization operator...
Best Answer
-
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Well, in the best of all worlds you would actually use 2 cross validations: an inner and an outer one.
Inner cross validation: using a specfic parameter setting / feature set / ...(whatever you optimize)... and evaluates the performance of the machine learning algorithm with those parameter / using those features / ... This performance is used by the optimization method to evaluate the fitness of what you optimize.
Outer cross validation: this is wrapped around the optimization method. In some sense, the optimization method is a machine learning algorithm on its own. So if you optimize the parameter like described above with an inner cross validation then you still overfit to the data set you use and even to the specific cross validation split (if you use a local random seed which always will lead to the same data partitions). The outer cross validation measures the effect of this optimization overfitting.
In practice, this approach often takes too long and the effect of parameters and features on overfitting are often a little bit less drastic than that of the machine learning model itself (this is not always true of course). So here is what I most often do: I keep back a small validation set (10% or so from the complete labeled data I have) and only go with the inner cross validation inside of Optimize Parameters / Features / ... After this is done I measure the performance using this optimized model on this small validation set just as a sanity check if the performance is still in the same ballpark or completely off. Again, this is not optimial, but a good balance between what is feasible in practice and getting to sufficiently robust models.
Best,
Ingo
1