Auto Model Rows

Madcap · February 2019

Hi, I am currently trying to use Auto Model with a data set which has roughly 1300 rows.
When I load the data I can see amount of rows at 1300, in select task it also has 1300 rows, the same in prepare target however when I get the results and choose a certain model, then go into predictions I can only see scoring for around 520 rows.

Is there any reason that about half of the rows are missing or not being displayed? I wondered if it was something to do with editing the model types? Currently I am just using the default setting e.g. Use regularisation, Automatically optimise.

I am currently using an academic license and I checked if it was a row limit but I have unlimited, which makes sense as when I manually make the models I can get results for the 1300 rows.

Thanks for any help you can offer.
-Jason

IngoRM · February 2019

Hi @Madcap,

Glad to hear from you. That behavior is actually what is supposed to happen. We create a 40% hold-out set from your input data to evaluated the model which happens to be those 520 rows. Predictions will be created for those to calculate how well the models work. See this discussion for more details: https://community.rapidminer.com/discussion/54774/auto-model-issue

There is really no point in doing this for the 60% of the data the model was trained on by the way. For more on this, I would recommend this white paper here: https://rapidminer.com/resource/correct-model-validation/

Hope this helps,
Ingo

BalazsBarany · February 2019

Maybe AutoModel should switch to cross-validation on smaller datasets.

The cross-validation is more accurate in this case. You get a higher number from AutoModel but that doesn't mean that the model is better, it just means that it got lucky when tested on less data.

varunm1 · February 2019

Hi @Madcap

You can choose cross validation results as your data set is small. Automodel might have higher accuracy as its not training and testing on whole data set.

Thanks
Varun

Telcontar120 · February 2019

I would argue that in all cases cross validation is a better performance indicator (in line with the whitepaper Ingo references above). Any split validation sample is always going to be subject to the idiosyncrasies of only a subset of the data and how it is different from the overall sample. It is true that in larger datasets this should diminish in magnitude, but cross-validation eliminates it entirely.

Madcap · February 2019

Thanks that makes sense.

Just one final thing, if that is okay, which results would I be inclined to use then? The manual decision tree (with cross validation) which takes into account all the rows or the auto model which takes 40%? The numbers are very similar maybe only 1%-2% difference, with auto model having higher accuracy.

Thanks again
-Jason

Madcap · February 2019

Thanks for your help guys.
I will take the cross validation reading then, I am actually looking into RapidMiner for my honours project (dissertation) so all of this advice is really helpful gives me more to write about!

Thanks
-Jason

Telcontar120 · February 2019

Yes, consistent with my comments above, I would report the performance results from the cross-validation.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Auto Model Rows

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers