The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Cross-Validation and Grid-Search Optimization
Hi all,
I was wondering if I could get some clarification on the correct nesting and setting of parameters to use grid-search optimization within n-fold cross-validation. I assume the optimization operator is nested within cross-validation to select parameters as described in this article: https://rapidminer.com/learn-right-way-validate-models-part-4-accidental-contamination/
How is the set parameter operator used to correctly set the parameter in question for the model to be applied to the data after optimization has been performed?
Any clarification on these processes would be helpful,
Thank you
Tagged:
0
Answers
The guys at RapidMiner made a series of pretty awesome tutorial videos. This one here should answer your question: 16 Optimization of the Model Parameters
Thank you for this video.
Just to confirm, if the cross-validation process is nested within a parameter optimization process, will the parameters be optimized for each iteration of cross-validation? My concern is that placing cross-validation inside of the optimize process will optimize on the entire data set rather than separately for each fold, resulting in contamination of the training and testing data.
RapidMiner executes everything that is nestend within the "Optimize Parameters" operators for each possible combination of parameters (as you define them). Hence, if you have a list of e.g. 121 possible combinations of parameters, RM will run 121 k-fold cross validations, one for each parameter combination. Therefore, it is important not to try too many combinations at once, otherwise your process can run a very long time, depending on the size and structure of your data.
You are right with that, if you optimize the parameters with the same dataset that you will train the model later on, you will have a bias.
Edit: You actually need an independent validation dataset to estimate the model error correctly. It is ok to optimize parameters in the training set.
Here is an article on the matter:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/
Best,
Sebastian
Thank you all for these responses.
To clarify, is it correct that to optimize parameters without bias, the optimization process must be nested within the cross-validation? Not cross-validation nested within optimization?
Thank you all
The less biased estimation would be the following:
outer CV with optimize parameters in the training side
+
CV inside the Optimize parameters operator
That could take quite a while depending on the model. Sometimes you can do without the outer CV, because the absolute performance of a model is rarely useful (more important is to compare different models, or to use problem specific measures such as cost saving).
You can also speed the process up using less folds and/or using the Optimize Parameters (evolutionary) operator.
Thank you for this clarification, SGolbert. In using this approach, is it possible to feed the model parameters identified by the optimization process directly into the testing portion of CV?
Thank you
Hi lsevel,
I have researched a bit into your question and I noticed that it is not very well described in the documentation. In short you have two ways of doing it:
1. Inside the optimize parameters operator, deliver the model from the mod port of the trainer (let's say SVM) to the res port. Then outside Opt Par (inside outer CV), deliver that res port to the testing side.
2. Use the set parameters operator. The operator help will provide sufficient guide on this one, basically you need an extra model operator to pass the parameter set to.
I personally find the first solution much simpler, but it's kind of counterintuitive at first, because the documentation says nothing about what you get out of the result ports of optimize parameter. But after some testing, I've found out that you get whatever result the best model delivered.
Best,
Sebastian
Hi Sebastian,
Thank you for this reply. I've pasted code based on the directions you gave (Option 1). Could you confirm that this is the correct organization?
Additionally, regarding the unbiased optimization estimate, I am wondering about the optimization of parameters for each training set. If there is an inner cross-validation within the optimiztion parameters, wouldn't the selected parameters be based on a subset of the training data? As a result, the optimized parameters would be selected on a subset of the training data and not optimized for the whole training set within each fold?
Thank you
(Small bump)
it's probably better if you tag @SGolbert than bumping.
Hi lsevel,
The process seems correct. About the second part, the cross validation operator returns a model trained with all the data (in that case one of the training folds). The optimized parameters are selected using the whole training fold, more precisely: averaging the performance of different subsets of said set (i.e. in the inner CV).
Best,
Sebastian
Ingo