The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"How to set Weights on Iris Data Set?"
I tried some different Algorythms on the Iris sample Data set and get around 96% Accuracy.
However the AutoModel gets 100% and i think this comes with the use of weights!?
Unfortunatly I'm not able to reproduce the process from the open process!
Can somone show me how to implement "weight by Correlation" for polynominal Data?
Thx,
Sebastian
Tagged:
0
Best Answer
-
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderHi,Please note that the outer validation (including everything from model building, parameter optimization, feature engineering etc.) is NOT a full k-fold cross validation. This is prohibitive in terms of runtimes (since it would blow up all runtimes by a factor 5x to 10x and our research has shown that users are not willing to wait for this).Instead, we introduced in 9.1 a multiple hold-out set approach plus a robust average calculation (removing the outliers before building the average value). While this is not as perfect as a full-blown cross validation, it gets close and keeps runtimes at an acceptable level. But you can still be lucky with some of the splits. This is by the way also true for cross-validation. However, specifically for Iris, the problem is that some of the data points with different classes are actually overlapping which means that with a full cross-validation you will never reach 100% while with a random split of 40% or so for the validation set you may actually end up where this overlap is not problematic.If you want to learn more about the validation topic please also check out this white paper here:We recently have updated it a bit to better explain why cross-validation is great if possible / feasible, the core aspect of correct validation actually is to validate ALL model optimizations. We use the multiple hold-out set approach described above for this.Hope this helps,
Ingo
7
Answers
I have a different hypothesis :
This performance of 100% (accuracy) is due to "luck" from my point of view . In deed, by defaut the "Auto-Model" tool
performs a Split Validation with a ratio Training/Test = 0.8 / 0.2. So the performance is calculated on 20 % of the dataset (so 30 examples for the Iris dataset), if the sampling is "lucky" all the test examples are correcty classified which explains this performance.
To convince you, you can :
- set an other "local random seed" for the sampling of the training/Test partition . For example here the results with local random seed = 1991
- decrease the ratio training/Test in the Split Data (split of a validation set) operator. In this case, there are more test examples and there is less "luck" to have all the test examples correctly classified . Here the results with ratio Train/Test of 0.7/0.3 (and local random seed = 1992) :
As beta tester, I was amazed by the RapidMiner's Studio owner, that Auto-Model don't perform Cross-Validation (instead Split Validation).
A priori with a Cross Validation, these kind of "perfect results" are impossible...
So is there any reason to perform Split Validation instead Cross-Validation in this tool (maybe time of computation..?).
And to conclude, the moral of this story is that "..in Data-science (and maybe more generally in the life), there are those who are lucky and ....
the others...."
I hope it helps,
Regards,
Lionel
As far as your other question goes, there is no (sensible) way to use Weight by Correlation for polynominal data. You could either look at another weighting approach (such as Weight by Information Gain) or you would have to transform all your data into binominal 0/1 flags and then calculate numerical correlations. But in neither case will using Weight... operators improve your model performance to 100%!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I just updated the RapidMiner with the 9.1 "official release" and tested rapidly the Auto-Model tool :
I wanted to warmly welcome the introduction of Cross-Validation inside Auto-Model and I must admit that there is an impressive work on this release.
Regards,
Lionel