The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RM 9.1 feedback : Auto-model documentation
lionelderkrikor
RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi,
I see that cross validation is now used to evaluate the performance of models in Auto-Model.
I see that the performance associated to the optimized model (calculated via a 3 - folds CV on the whole training set -by defaut 60% of the dataset- ) is different of the performance of the model delivered by the Performance average (Robust) operator (calculated via a (7-2 = 5 -by default- folds)CV on the test set - 40 % of the dataset). I think that this principe of evaluation of the performances must be explained in the documentation of Auto-Model (in the documentation of the "results" screen). Moreover the actual documentation is out of date :
Generally, I think that these elements are important and must be read and understood by the user.
I have a subsidiary question about Auto-Model :
Why the data sampling is different according th the used model, for example :
NB ==> max 2000000 examples
SVM ==> max 10000 examples ?
Thank you for your attention,
Regards,
Lionel
I see that cross validation is now used to evaluate the performance of models in Auto-Model.
I see that the performance associated to the optimized model (calculated via a 3 - folds CV on the whole training set -by defaut 60% of the dataset- ) is different of the performance of the model delivered by the Performance average (Robust) operator (calculated via a (7-2 = 5 -by default- folds)CV on the test set - 40 % of the dataset). I think that this principe of evaluation of the performances must be explained in the documentation of Auto-Model (in the documentation of the "results" screen). Moreover the actual documentation is out of date :
Generally, I think that these elements are important and must be read and understood by the user.
I have a subsidiary question about Auto-Model :
Why the data sampling is different according th the used model, for example :
NB ==> max 2000000 examples
SVM ==> max 10000 examples ?
Thank you for your attention,
Regards,
Lionel
Tagged:
0
Best Answers
-
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderHi Lionel,You are right - we will update the documentation accordingly. Sorry for this oversight.Please note that the outer validation is not a full cross-validation but a multiple hold-out set approach with a robust average calculation (by removing the two outliers). While this estimation is obviously not as good as a full-blown cross-validation, it comes close plus it has a lower runtime and delivers at least some idea of the deviation of the results.The different sample sizes are used to ensure an acceptable runtime for the complete AM run. The different algorithms have all different algorithmic complexities. Naive Bayes for example can be calculated with a single data scan (i.e. linear runtime which is as fast as it gets). An SVM on the other hand has a cubic runtime which would take ages on millions of data rows.Best,
Ingo2 -
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderJust FYI: the documentation has been updated and will be delivered with the next release.Thanks again for pointing this out,Ingo6
Answers
Thanks for your detailed answer . It's clear in my mind now.
Regards,
Lionel