Automodel feedback : Debate about the models training

lionelderkrikor · June 2019

Dear all,

I wanted friendly and humbly open a debate about the training method of the models in RapidMiner's Auto Model.
In deed, from what I understood of the "data science methodology", after evaluating and selecting the "best" model, this one has to be (re)trained with the whole initial dataset before entering in production.
This principle is also applied by the Split Validation operator : The model delivered by RapidMiner is trained with the whole input dataset (independently of the split ratio).
BUT, this is not the case in Auto Model, the model(s) provided / made available by RapidMiner's Auto Model is (are) trained with only 60 % of the input dataset.
My first question is : Is it always relevant to (re)train the selected model with the whole input dataset ?
if yes and if it is feasible , it is maybe a good idea to implement this principle in Auto Model.(I think of users (no data-scientists /beginners) who do not want to ask questions and who just want a model to go into production...)
But maybe for a computation time constraint, (or another technical reason) it is not feasible to (re)train all the models with the whole initial dataset ?
In this case (not feasible), it is maybe a good idea to advise the user in Auto Model (in the documentation and/or in the overview of the results and/or in the "model" menus of the differents models) to (re)train manually, by generating the process of the selected model, before it enters in production...

To conclude, I hope I helped advance the debate and I hope to have your opinion on these topics.

Have a nice day,

Regards,

Lionel

varunm1 · June 2019

Hello @lionelderkrikor

Thanks for starting on this, I do have a question regarding this,

1. Automodel has heavy feature engineering (model based) before training the model. So, this happens only on 60 % of dataset right now as the remaining 40% is for testing. My question is if we train model again on complete data, aren't the features selected be impacted by the 40% of data and change model dynamics?

Thanks
Varun

sgenzer · June 2019

Thank you @lionelderkrikor for this. All really good points. I am, of course, passing this to the Auto Model-er himself @IngoRM as he is the one best to participate in this discussion from our side.

Scott

IngoRM · June 2019

Not much to add here. The practice to retrain the model on the complete data set is most useful for small data sets, for larger ones it typically matters less (as a result of the learning curve, i.e. the fact that models reach a plateau sooner or later where more data is no longer helping that much to get to better models).

But I do have a question: let's say AM would automatically generate scoring processes for you - which model would you actually prefer to be used? One which is retrained on the complete data but likely behaves and even looks differently than the one shown in Auto Model or the exact one you have seen in AM?

The reason why I ask is that I have been working on something like that and, following this best practice, actually retrained the model on the complete data. But the first time I noticed that the model looks different (different coefficients for a GLM, for example, or even more obvious: different structures for decision trees), I was no longer sure if this is a good approach. Would that not confuse many users? So I actually ended up using the one shown by Auto Model so that we are not running into the problem with users saying "why is the prediction X - according to the model you show in AM it should be Y?"

You could argue: why not showing the complete model in AM then instead? Because then the shown model and the predictions on the validation set won't match any longer.

We could show both but is that really better?

You see, not really as straightforward as I originally thought so I really would appreciate your input on this...

Thanks,
Ingo

Telcontar120 · June 2019

@IngoRM
Personally I prefer to use the retrained model on the entire dataset in production---or at least to have that option.

We do validation in the first place to understand the likely performance of a model in production on unseen data, not because it is inherently better to use a model trained on a subset of the data. It's analogous to why we don't return one of the individual training models from the cross-validation operator, but rather a model run on the full dataset.

For smaller datasets, this can indeed make a difference. For larger datasets, I agree that it should likely converge to a very similar model regardless, but even in those cases, I would be more likely to go back and take a random subset more based on overall sample size and then go through all the steps on that random subset (including feature engineering and feature selection as well as model parameter estimation) and compare it. If it really did change significantly in terms of behavior from my earlier output I would probably then have concerns about model robustness and stability.

It's also why it would be much better to utilize cross validation in AutoModel rather than split validation, because then you would not have the problem that you are posing in the first place (a different model reported in AM results vs the production model on the full data). If you used cross validation, this difference would go away.
I know there are some other reasons why you preferred to use split validation in AM, but this is one unhappy consequence of that decision. It also runs contrary to the point we make in training (and that you have made in numerous other contexts as well) that cross-validation represents the best approach to model validation, the so-called "gold-standard" of data science, to then use split validation as the basis for AM.

Just my $0.02 since you asked :-)

IngoRM · June 2019

Thanks, your thoughts are much appreciated! I think we probably will agree that for putting a model into production, one which was trained on the full data would be best. However, I don't think that I made a good job when I posted my earlier question since...

If you used cross validation, this difference would go away.

...is actally not the case (more below). The problem of potential user confusion would be the same. In fact, my believe that many users (and please keep in mind that many / most users have much less experience than you and I) will be confused is exactly coming from the fact that many people ask things like "which model is produced by cross validation".

So let me try to explain better what problem I want to solve. And please let's not drag ourselves into in a "pro/contra" cross validation discussion - this is an independent topic. In that spirit, I will try to create a better explanation by actually using a cross validation example for the issue

Let's say we take a sample of 100 rows from Titanic Training (in the Sample folder). We then perform a cross-validation and deliver the model created on all data. Here is the model:

I have highlighted one particular branch in the model. If I know check the data set with all predictions, I get the following (sorted for gender and age):

Image: https://us.v-cdn.net/6030995/uploads/editor/ac/4yr1h3bv1c9h.png

If you compare the highlighted data points with the highlighted branch in the model, the predictions are "wrong". We of course do understand why that is the case, so that's not my point / the problem I want to solve.

I am just looking for a good way how we can help less experienced users understand the difference and where it is coming from - hence my question. In the split validation like it is done by Auto Model, that confusion does not exist since the created predictions and the model have a 1:1 relationship. But with cross validation or in general with any delivered model that is trained on a different / full data set, that will happen.

One direction I am exploring right now is to actually show TWO models in the AM results: the one which is used to calculate the error rates and predictions (like the one we have today in AM) and then a second one called "Production Model" or something like this. Then at least the difference is explicit and can be documented. I would hate to have a UI paradigm with some implicit assumptions which most users would not understand - this is surefire recipe for a bad user experience.

Hope I did a better job this time explaining the potential confusion I see with the "full model" idea and please let me know if any of you have any additional thoughts...

Best,
Ingo

P.S.: Here is the process I have used to create the model and predictions above:

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-SNAPSHOT"><br>&nbsp; <context><br>&nbsp;&nbsp;&nbsp; <input/><br>&nbsp;&nbsp;&nbsp; <output/><br>&nbsp;&nbsp;&nbsp; <macros/><br>&nbsp; </context><br>&nbsp; <operator activated="true" class="process" compatibility="9.4.000-SNAPSHOT" expanded="true" name="Process"><br>&nbsp;&nbsp;&nbsp; <parameter key="logverbosity" value="init"/><br>&nbsp;&nbsp;&nbsp; <parameter key="random_seed" value="2001"/><br>&nbsp;&nbsp;&nbsp; <parameter key="send_mail" value="never"/><br>&nbsp;&nbsp;&nbsp; <parameter key="notification_email" value=""/><br>&nbsp;&nbsp;&nbsp; <parameter key="process_duration_for_mail" value="30"/><br>&nbsp;&nbsp;&nbsp; <parameter key="encoding" value="UTF-8"/><br>&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="retrieve" compatibility="9.4.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="repository_entry" value="//Samples/data/Titanic Training"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="sample" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Sample" width="90" x="179" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample" value="absolute"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="balance_data" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_size" value="100"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_ratio" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sample_probability" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_size_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_ratio_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="sample_probability_per_class"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_local_random_seed" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="local_random_seed" value="1992"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="concurrency:cross_validation" compatibility="9.4.000-SNAPSHOT" expanded="true" height="145" name="Validation" width="90" x="313" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="split_on_batch_attribute" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="leave_one_out" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="number_of_folds" value="10"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="sampling_type" value="stratified sampling"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_local_random_seed" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="local_random_seed" value="1992"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="enable_parallel_execution" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="criterion" value="gain_ratio"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="maximal_depth" value="10"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="apply_pruning" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="confidence" value="0.1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="apply_prepruning" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_gain" value="0.01"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_leaf_size" value="2"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="minimal_size_for_split" value="4"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="number_of_prepruning_alternatives" value="3"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="training set" to_op="Decision Tree" to_port="training set"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Decision Tree" from_port="model" to_port="model"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_training set" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_model" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_through 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </process><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <process expanded="true"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="apply_model" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <list key="application_parameters"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="create_view" value="false"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <operator activated="true" class="performance" compatibility="9.4.000-SNAPSHOT" expanded="true" height="82" name="Performance" width="90" x="179" y="34"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <parameter key="use_example_weights" value="true"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="model" to_op="Apply Model" to_port="model"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Performance" from_port="performance" to_port="performance 1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Performance" from_port="example set" to_port="test set results"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_model" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_test set" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_through 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_test set results" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_performance 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_performance 2" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </process><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </operator><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Sample" to_port="example set input"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Sample" from_port="example set output" to_op="Validation" to_port="example set"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Validation" from_port="model" to_port="result 1"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <connect from_op="Validation" from_port="test result set" to_port="result 2"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="source_input 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 1" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 2" spacing="0"/><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <portSpacing port="sink_result 3" spacing="0"/><br>&nbsp;&nbsp;&nbsp; </process><br>&nbsp; </operator><br></process>

varunm1 · June 2019

Hello @IngoRM

Thanks for your explanation. Coming to the modeling part, one idea is to decide size of dataset based on the samples and dimensions and trigger cross validation or split validation in the backend for validation purpose. One of my concern regarding small datasets is 40 percent test data. In this case, model dynamics changes a lot if 40 percent of data is used for training at the end with original data(incase if we need to model data on whole dataset), if possible why can't we adopt cross validation or split validation based on size of data. Deciding size is not an easy task as well, but need to test the possibility.

What this will do?

The advantage are, cross validation for a small data set seems to be more stable and reduce algorithm over estimating performance and lowers the impact of changing model dynamics when trained on whole dataset. In case of huge data, as you mentioned earlier, the algorithm converges when it reaches to certain extent of data so split can be appropriate there.

Coming to user exp:

We can provide users with an option to check what kind of validation is triggered so that and experinced user can check if they want to more technicality of model building.

Your idea of showing two models might increase run time and as you mentioned might confuse most of the novice users.

There are just my thoughts (0.02 INR).

Thank you

Telcontar120 · June 2019

@IngoRM Thanks for the additional explanation. I was focused on the difference between the delivered models in split vs cross validation output, not on the reported predictions from the test set. And indeed you are perfectly correct (as of course you already know, but for the benefit of anyone else reading this post understanding that we are in agreement) that the predictions delivered from the "test" output from cross-validation would not be consistent with those generated from the full model delivered from that same operator.
So having re-focused the issue in the way you have described, then I concur that the best solution is probably to present two models and associated output in AM, one for validation purposes and one for production purposes.

sgenzer · June 2019

as the moderator FYI I'm just changing this to a 'discussion' instead of a 'question' for organizational purposes

IngoRM · June 2019

Hey @Telcontar120

Thanks for the feedback. I also get more and more convinced that two models in the results are the way to go then. So I am happy to confirm that we WILL do the following changes in future releases of Auto Model (most likely to start with 9.4):

We will also build the ready-to-deploy production model on the full data set.
We will show two results: Model (which is the validated one) and Production Model (which is rebuilt on the complete data). This also allows to inspection / comparison to identify potential robustness problems.
The processes will also be completely redesigned (see sneak peek below) which should help with understandability and management of those processes.
Additional scoring processes using the production model and all necessary preprocessing steps will be automatically created for you if you save the results.

We will, however, NOT change the validation method. I know that this is disappointing for some, but please see my comments on the rationale here: https://community.rapidminer.com/discussion/55736/cross-validation-or-split-validation

The production model is independent on the validation and the results of the hybrid approach (3) in the discussion linked above are absolutely comparable with those of cross-validations in almost all situations. But the additional runtimes - but potentially even more important the lack of maintainability for processes of that complexiy - do not make a change feasible.

I gave this long thoughts and experimented a lot, but I am 100% convinced that this is the best way forward. Hope you folks understand.

Best,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Automodel feedback : Debate about the models training

Comments

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing