The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Weka - Random Forest"
dragonedison
Member Posts: 17 Contributor II
Dear everyone,
I have a training set and a test set, each has 130 attributes. I apply Weka-Random Forest to train the training set with all the attributes. The program selects 8 attributes of the set and generates 100% accuracy for the training set, however its performance is rather poor for the test set ---- only 53.7% accuracy.
Then I try to train the training set with only one attribute each time and then apply each of the 130 classifiers to the test set, and I discover that some of these classifiers are able to produce 80% accuracy for the test set, although their performances are not the best among the 130 classifiers for the training set.
What I want to know is, how can I train an even better classifier for the test set using those attributes that can produce 80% accuracy(of course, I can't apply the test set to train the classifier). Should I just simply choose the good attributes and put them into the Random Forest training, or are there any better ways to implement this?
Thanks.
Gary
I have a training set and a test set, each has 130 attributes. I apply Weka-Random Forest to train the training set with all the attributes. The program selects 8 attributes of the set and generates 100% accuracy for the training set, however its performance is rather poor for the test set ---- only 53.7% accuracy.
Then I try to train the training set with only one attribute each time and then apply each of the 130 classifiers to the test set, and I discover that some of these classifiers are able to produce 80% accuracy for the test set, although their performances are not the best among the 130 classifiers for the training set.
What I want to know is, how can I train an even better classifier for the test set using those attributes that can produce 80% accuracy(of course, I can't apply the test set to train the classifier). Should I just simply choose the good attributes and put them into the Random Forest training, or are there any better ways to implement this?
Thanks.
Gary
Tagged:
0
Answers
Could you please post the workflow to clearify the process you are working with? I am missing for example the number of trees you are using and the maximal treedepth. This highly influences the accuracy.
Is there a reason why you use the Weka Random Forest instead of the (normal RapidMiner) Random Forest? I am second one and am totally satisifyied with it.
Cheers,
Markus
I use 100 trees to grow because I read some articles that this number of trees best trades off between accuracy and computation time; and the depth of the tree is unlimited.
The process is as shown in the image.
The reason why I choose Weka-RF is that the RF provided by RapidMiner would produce memory error for my dataset, so I have to use the Weka one.
Regards,
Gary
In my mind this is also the reason for the 100% accuracy with 8 attributes.
BTW: The images are not working at my PC, could you try sending the process as XML.
Best,
Markus
Please refer to these links for the images.
http://img307.ph.126.net/r_HvVZDAI1Xg_dMgHYNuNQ==/4786763453941444218.jpg
http://img307.ph.126.net/NWAYQdbrX1kwscyTNjAKBg==/4786763453941444209.jpg
I did not use unlimited tree depth for the RapidMiner random forest. I used depth 20, but I generate more than 100 trees for 10000 data, about 500 trees.
I would like to know why unlimited depth of trees will generate 100% accuracy, so how many depths should I use?
Regards,
Gary
There is another problem I am concerned with the Weka Random Forest. When I use the Performance operator to obtain the classification accuracy of the model(see the upper post for the process image), I get two kinds of classification results, namely "Multiclass Classification Performance" and "Binary Classification Performance". The "Binary Classification" performance is rather poor. I would like to know what the difference of these two performance is? Both my training and test data are bi-class data.
Thanks.
Gary
Unfortunately the links are not working too, so I still have to assume a little bit.
First of all Binary Classification Performance is for Binominal Attributes (which means that there obviously only two values). Multiclass Performance is for classification in more classes. Maybe there is something wring with your label, because if you have only two classes, those two values should be the same.
OK, for such a big dataset, I would recommend to reduce the number of trees. My PC stuck with about 1200 trees on 250 examples. Try to start with something like 100 and increase the number of trees stepwise to 1000 when the performance is poor. Thereby you can also see how the performance changes and where the memory limit is.
Hope this helps!
Best,
Markus
Thank you for giving so many important advices. I decide to paste the XML here for your reference.
Regards,
Gary
I