How to get best out of n models?
I've been training text data for classification and have 3 different models for now (SVM / LDA and Bayes) and while all 3 off them give me the same results in average there are some noticeable differences in areas where the model 'doubts' the right label to predict.
So I'd like to combine the actual output of all 3 of them (or even more in future) to come with a kind of 'best out of 3 solutions'
So if all of my models predict label_x for a given record this is an obvious winner
if 2 out of 3 predict label_x this should be the final one
If all 3 predict a different label it needs more attention / be skipped
Are there operators that can do this thing? For now I have a relative complex setup that does this for me also, but if there is something more structured and out of the box available it would be handy.
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Yes, I see the difference. I don't think there is anything out of the box to do what you want with already trained models. Although it should be easy enough to replicate those original models with the same parameters inside the ensemble model operators, assuming you built them originally in RapidMiner.
If you are committed to using your earlier models, than a manual approach with Apply Model and Generate Attributes (and maybe some looping) is probably your best bet---and it sounds like you might already be doing that.
0
Answers
Of course, this is a textbook case for ensemble modeling. There a multiple operators available in RapidMiner to handle this type of situation. Based on your description, your two main options here are voting and stacking.
Voting will quite simply use all the separate models independently and make the final determination based on majority votes (so it is helpful to have an odd number of models). This is not a weighted vote, though, so there is some loss of precision involved. You could separately use Generate Attributes to create a weighted vote if you want.
The other option you have is Stacking, which actually uses one top-level ML algorithm to decide which individual models to use in different examples based on overall performance. This is often done with a decision tree learner, although more complex schemes are also feasible.
Both of these are available in the base operators for RapidMiner and have tutorials if you want to see the setup. I'd encourage you to try them both out and see which one works better for your use case.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
And to deal with this part:
"If all 3 predict a different label it needs more attention / be skipped"
You could use the operator "Drop Uncertain Predictions". This operator will issue a "?" for confidences that fall below a user-supplied threshold.
Thanks Brian,
I looked at these earlier, but unless I am missing something the out of the box solutions (both vote and stacking) are meant to be used during the training process, while I have already 3 dedicated and trained models available. So in the end using these I would have one model using some mix based on my training data. It remains an option of course, but at this stage I'd prefer to keep my seperate models as it is a bit easier to tune the preparation by model.
So is there a way to do something similar, but with already pretrained models ? So my unlabeled data would use the 3 saved models, and then see which label is provided the most? I have build a process doing exactly this, but if there are operators doing this more streamlined that would be nice.
True, but it doesn't work for all models, as for instance SVM will provide either 1 or 0 for a given predicted label. And as I do not want to loose records in the process (since I join now 3 different model predictions in one set) I would have to assign a new label (like undefined or so) for these. It's something I could add in a later stage but first I'd like to undrstand if there are operators making it easier so I can simplify my current logic.