The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Test set beating training set
alan_jeffares
Member Posts: 20 Contributor I
Hello,
I have begun using RapidMiner recently and am having a strange problem with one of my workflows. I have split a dataset using the Split Data function, I have then built a random forest on the 90% Training set and applied that model on the 10% test set. However when I asses the performances, the test set consistently does better even as I vary the seeds. This result seems counter intuitive and I'm wondering if I have interpreted one of the arguments wrongly or am missing a detail?
By the way I am aware that there are more efficient ways to set up this flow, I am trying alternative ways as a bit of practice
Thanks
0
Answers
Ok, you have to be careful here with your setup because the results are misleading based on your choice of partition size for the Split operator. Why 90% and 10%? Why not 85% and 15%? You will get varying results based on the size of your split, seed, and how you split the data. I noticed you used stratified sampling, which samples your data according to the class distribution of survivorship (Yes/No), so you can get strange results there.
What I suggest is to use a Cross Validation operator as your setup appears to try to mimic that thought process. I ran the process below by changing the seed and got that the Training Perf is slightly better than the Test Perf. Then I added a Cross Validation and measured the results there.
Also, you can use the Select Attributes operator to select the attributes you want inlieu of the R script if you want.
Thanks for the response
Yes I am aware that some of my parameters were a bit weird but I was varying them all and getting similar results. Turns out I was changing the seed in the wrong operator, silly mistake.
Regarding the choice of operators, I was just using things such as the execute R just to try out different operators and get a feel for how everything works. Thanks for the help