The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Automodel feedback : Debate about the models training
lionelderkrikor
RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Dear all,
I wanted friendly and humbly open a debate about the training method of the models in RapidMiner's Auto Model.
In deed, from what I understood of the "data science methodology", after evaluating and selecting the "best" model, this one has to be (re)trained with the whole initial dataset before entering in production.
This principle is also applied by the Split Validation operator : The model delivered by RapidMiner is trained with the whole input dataset (independently of the split ratio).
BUT, this is not the case in Auto Model, the model(s) provided / made available by RapidMiner's Auto Model is (are) trained with only 60 % of the input dataset.
My first question is : Is it always relevant to (re)train the selected model with the whole input dataset ?
if yes and if it is feasible , it is maybe a good idea to implement this principle in Auto Model.(I think of users (no data-scientists /beginners) who do not want to ask questions and who just want a model to go into production...)
But maybe for a computation time constraint, (or another technical reason) it is not feasible to (re)train all the models with the whole initial dataset ?
In this case (not feasible), it is maybe a good idea to advise the user in Auto Model (in the documentation and/or in the overview of the results and/or in the "model" menus of the differents models) to (re)train manually, by generating the process of the selected model, before it enters in production...
To conclude, I hope I helped advance the debate and I hope to have your opinion on these topics.
Have a nice day,
Regards,
Lionel
I wanted friendly and humbly open a debate about the training method of the models in RapidMiner's Auto Model.
In deed, from what I understood of the "data science methodology", after evaluating and selecting the "best" model, this one has to be (re)trained with the whole initial dataset before entering in production.
This principle is also applied by the Split Validation operator : The model delivered by RapidMiner is trained with the whole input dataset (independently of the split ratio).
BUT, this is not the case in Auto Model, the model(s) provided / made available by RapidMiner's Auto Model is (are) trained with only 60 % of the input dataset.
My first question is : Is it always relevant to (re)train the selected model with the whole input dataset ?
if yes and if it is feasible , it is maybe a good idea to implement this principle in Auto Model.(I think of users (no data-scientists /beginners) who do not want to ask questions and who just want a model to go into production...)
But maybe for a computation time constraint, (or another technical reason) it is not feasible to (re)train all the models with the whole initial dataset ?
In this case (not feasible), it is maybe a good idea to advise the user in Auto Model (in the documentation and/or in the overview of the results and/or in the "model" menus of the differents models) to (re)train manually, by generating the process of the selected model, before it enters in production...
To conclude, I hope I helped advance the debate and I hope to have your opinion on these topics.
Have a nice day,
Regards,
Lionel
Tagged:
1
Comments
Thanks for starting on this, I do have a question regarding this,
Thanks
Varun
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Scott
Ingo
Personally I prefer to use the retrained model on the entire dataset in production---or at least to have that option.
We do validation in the first place to understand the likely performance of a model in production on unseen data, not because it is inherently better to use a model trained on a subset of the data. It's analogous to why we don't return one of the individual training models from the cross-validation operator, but rather a model run on the full dataset.
For smaller datasets, this can indeed make a difference. For larger datasets, I agree that it should likely converge to a very similar model regardless, but even in those cases, I would be more likely to go back and take a random subset more based on overall sample size and then go through all the steps on that random subset (including feature engineering and feature selection as well as model parameter estimation) and compare it. If it really did change significantly in terms of behavior from my earlier output I would probably then have concerns about model robustness and stability.
It's also why it would be much better to utilize cross validation in AutoModel rather than split validation, because then you would not have the problem that you are posing in the first place (a different model reported in AM results vs the production model on the full data). If you used cross validation, this difference would go away.
I know there are some other reasons why you preferred to use split validation in AM, but this is one unhappy consequence of that decision. It also runs contrary to the point we make in training (and that you have made in numerous other contexts as well) that cross-validation represents the best approach to model validation, the so-called "gold-standard" of data science, to then use split validation as the basis for AM.
Just my $0.02 since you asked :-)
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
...is actally not the case (more below). The problem of potential user confusion would be the same. In fact, my believe that many users (and please keep in mind that many / most users have much less experience than you and I) will be confused is exactly coming from the fact that many people ask things like "which model is produced by cross validation".
I have highlighted one particular branch in the model. If I know check the data set with all predictions, I get the following (sorted for gender and age):
If you compare the highlighted data points with the highlighted branch in the model, the predictions are "wrong". We of course do understand why that is the case, so that's not my point / the problem I want to solve.
Ingo
Thanks for your explanation. Coming to the modeling part, one idea is to decide size of dataset based on the samples and dimensions and trigger cross validation or split validation in the backend for validation purpose. One of my concern regarding small datasets is 40 percent test data. In this case, model dynamics changes a lot if 40 percent of data is used for training at the end with original data(incase if we need to model data on whole dataset), if possible why can't we adopt cross validation or split validation based on size of data. Deciding size is not an easy task as well, but need to test the possibility.
What this will do?
The advantage are, cross validation for a small data set seems to be more stable and reduce algorithm over estimating performance and lowers the impact of changing model dynamics when trained on whole dataset. In case of huge data, as you mentioned earlier, the algorithm converges when it reaches to certain extent of data so split can be appropriate there.
Coming to user exp:
We can provide users with an option to check what kind of validation is triggered so that and experinced user can check if they want to more technicality of model building.
Your idea of showing two models might increase run time and as you mentioned might confuse most of the novice users.
There are just my thoughts (0.02 INR).
Thank you
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
So having re-focused the issue in the way you have described, then I concur that the best solution is probably to present two models and associated output in AM, one for validation purposes and one for production purposes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Ingo