Running a model
Hello and happy holidays for everyone.
Here I comment.
I'm running some models that take me almost 9 minutes each when I cross them valid. They are about 30 models in total. I need to save time. For this I eliminated the cross validation but when applying the same model on the same control group the results are different. This was expected. The results lose quality. But I save 80% of the time.
The question is: Is it possible to apply the model that returns the cross-validation and apply it to the control group without having to run the whole process over and over again?
I do not know if I explained well.
A greeting.
Best Answer
-
yani_ca031 Member Posts: 16 Maven
I have left out of my data a control group. I then apply the model to this control group and that is when I have that difference between them. What I did was save the model that the cross-validation operator returns to me and then apply only this on my new data and the result is the desired one and at an enviable speed. This program is great.
1
Answers
Hi!
I think you are misunderstanding the purpose of cross-validation.
You can use cross-validation to check the quality of your model building process (which can include feature selection, normalization, PCA and other transformations) on data that was not used in building the model. This helps you estimate the future performance of your process on future data.
The cross-validation doesn't improve the model itself. The model coming out from the cross-validation process is just the one that's being built from all the data in the dataset.
When you're happy with the performance of your modeling process, you can take out the cross-validation and build your production models faster. (You should do a cross-validation from time to time to monitor the performance of your modeling process.)
Regards,
Balázs
Hi again and thanks for your help.
That's what I thought, at first. But the reality is different. The results when applied separately differ a lot. In my case, at least. I have extracted the operator from the validation process and run it again and the prediction results when applied to the same control group differ greatly. As if it were not the same model.
RM allows me to save the process but not the model?
Is it possible that, when using a metacost, I change the position of the labels in the cost matrix and this modifies the prediction? If so, how would I know what position each label occupies in the matrix to assign the costs to?
You can save a model in RapidMiner. Just use the store operator and output the MOD port from the Cross Validation port or from whatever Algo you are using.
That said, switching up labels or changing any preprocessing can have an effect on your performance. You are no longer comparing 'apples to apples.'
Thanks Thomas, would this solve my initial problem?
What would be the solution to the problem of having to run the entire process every time a prediction is made because when the operator is extracted from the cross validation it returns a different prediction.
And because it returns a different prediction? As if it were not the same model that is validated and the one that is applied without validating
I don't know how you have your process set up but if you use the same data, same seed, and same validation process, your test results should be the same and repeatable. From what you are drescribing, something doesn't sound right.
Edit: In Cross Val, the results will give a +/- to your measure of performance. That indicates the range of how your model may perform in reality. So if you get an accuracy of 60% +/- 15%, then your model could be as bad as 45% or as good at 75% in production. The lower that deviation, the more stable your model is. I would check your performance results and evaluate that. If the swing is too high, then I would look at rebuilding your model.
Normally, you train and store the model in seperate process and then have a 2nd scoring process when in production. Retraining a model for every prediction is just super time consuming. Scoring is fast.
Now, when I save the model that returned the cross-validation and applied it to the control group, I get the same result. The question then changes. Because I do not have the same when I use the same apprentice with the same parameters outside the cross validation. The apprentice is a metacost with a tree of decision.
Thanks Thomas. I do not change anything in the process. I only extract the apprentice from inside the cross validation operator, I cut it and I stick it out, and I do the test and this happens that I mention. Is rare. This should not happen. I think.
Now it's fast
Hi again,
how are you checking the quality of the model the second time? (What are you comparing to the performance output from the cross validation?)
If you apply the model on its own input data, then those results are simply irrelevant and shouldn't be used. You could use k-NN with k=1 in that scenario and get a 100 % correct model according to this wrong method of validation.
MetaCost uses the same order for the attribute values as you see in the confusion matrix. You can use the Remap Binominals operator to create a fixed "order" of your binominal attribute values.
Regards,
Balázs