The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Other ways to Validate results
Hello,
I have a database of 84 rows and 400 attributes, which is a classifier problem. I prepared the Data, that i can exercise the decission tree or other tree models. To evaluate and test the Model i use the performance operator, espacially the accuraccy. I split the Data in a ratio of 80/20. 80% is the trainingset and 20% the testset.
The result of this Model is an accuracy of 80%. When I change the Split type for example from statified to shuffled or the ratio from 80/20 to 70/30, the accuracy drops to 60%. Now my question:
Is this phenomenon normal? Is there any other way to validate a classification model? And probably a bad question which only can be answered by seeing the process: Why does the model accuracy varies so drastically by just the splitting rate or splitting type?
Thanks a lot!
0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @dome
Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.
Here is a detailed thread on the working of cross-validation.
https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio#latest
Hope this helps. Please inform if you need more info.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
1 -
varunm1 Member Posts: 1,207 UnicornHello @dome
Here are the reasons when I use stratified or Shuffled.
Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have 15 Class A and 5 Class B samples.
Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.
Now, why stratified and not shuffled?
Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.
Hope this helpsRegards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
1
Answers