"How to set Weights on Iris Data Set?"

geb_hart · December 2018

I tried some different Algorythms on the Iris sample Data set and get around 96% Accuracy.

However the AutoModel gets 100% and i think this comes with the use of weights!?

Unfortunatly I'm not able to reproduce the process from the open process!

Can somone show me how to implement "weight by Correlation" for polynominal Data?

Thx,

Sebastian

IngoRM · December 2018

Hi,

Please note that the outer validation (including everything from model building, parameter optimization, feature engineering etc.) is NOT a full k-fold cross validation. This is prohibitive in terms of runtimes (since it would blow up all runtimes by a factor 5x to 10x and our research has shown that users are not willing to wait for this).

Instead, we introduced in 9.1 a multiple hold-out set approach plus a robust average calculation (removing the outliers before building the average value). While this is not as perfect as a full-blown cross validation, it gets close and keeps runtimes at an acceptable level. But you can still be lucky with some of the splits. This is by the way also true for cross-validation. However, specifically for Iris, the problem is that some of the data points with different classes are actually overlapping which means that with a full cross-validation you will never reach 100% while with a random split of 40% or so for the validation set you may actually end up where this overlap is not problematic.

If you want to learn more about the validation topic please also check out this white paper here:

https://rapidminer.com/resource/correct-model-validation/

We recently have updated it a bit to better explain why cross-validation is great if possible / feasible, the core aspect of correct validation actually is to validate ALL model optimizations. We use the multiple hold-out set approach described above for this.

Hope this helps,
Ingo

lionelderkrikor · December 2018

Hi @geb_hart,

I have a different hypothesis :
This performance of 100% (accuracy) is due to "luck" from my point of view . In deed, by defaut the "Auto-Model" tool
performs a Split Validation with a ratio Training/Test = 0.8 / 0.2. So the performance is calculated on 20 % of the dataset (so 30 examples for the Iris dataset), if the sampling is "lucky" all the test examples are correcty classified which explains this performance.
To convince you, you can :
- set an other "local random seed" for the sampling of the training/Test partition . For example here the results with local random seed = 1991

- decrease the ratio training/Test in the Split Data (split of a validation set) operator. In this case, there are more test examples and there is less "luck" to have all the test examples correctly classified . Here the results with ratio Train/Test of 0.7/0.3 (and local random seed = 1992) :

Image: https://us.v-cdn.net/6030995/uploads/editor/zf/dfd197itbsyg.png

As beta tester, I was amazed by the RapidMiner's Studio owner, that Auto-Model don't perform Cross-Validation (instead Split Validation).
A priori with a Cross Validation, these kind of "perfect results" are impossible...
So is there any reason to perform Split Validation instead Cross-Validation in this tool (maybe time of computation..?).

And to conclude, the moral of this story is that "..in Data-science (and maybe more generally in the life), there are those who are lucky and ....
the others...."

I hope it helps,

Regards,

Lionel

Telcontar120 · December 2018

I agree with everything @lionelderkrikor says about cross validation above.
As far as your other question goes, there is no (sensible) way to use Weight by Correlation for polynominal data. You could either look at another weighting approach (such as Weight by Information Gain) or you would have to transform all your data into binominal 0/1 flags and then calculate numerical correlations. But in neither case will using Weight... operators improve your model performance to 100%!

sgenzer · December 2018

cc'ing @IngoRM about Split vs Cross Validation in Auto Model.

lionelderkrikor · December 2018

RM Staff,

I just updated the RapidMiner with the 9.1 "official release" and tested rapidly the Auto-Model tool :
I wanted to warmly welcome the introduction of Cross-Validation inside Auto-Model and I must admit that there is an impressive work on this release.

Regards,

Lionel

geb_hart · December 2018

I also tested the auto Model on the Iris Data in the 9.1 release.. and still get 100% with 3 of seven Models

Image: https://us.v-cdn.net/6030995/uploads/editor/lh/0ps3cd9uyzm8.png

and still belief that weights play a role in them, but not reproducabel for me

Image: https://us.v-cdn.net/6030995/uploads/editor/x2/pdvpx1vjzrmv.png

Please try it for yourself and if you could rebuild the process for GLM or SVM I would like to see it

Thx for your Comments!!

M_Martin · December 2018

Colleagues: a very interesting conversation, and I particularly interesting (and also somewhat worrisome) is the fact that RapidMiner marketing experience seems to indicate that users have a low patience threshold - this is a bottle of wine conversation topic in and of itself. Best wishes, Michael Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"How to set Weights on Iris Data Set?"

Best Answer

Answers