Poor recall and precision classification results
Hello RapidMiner community!
As a newbie to the machine learning and data mining world, I'd first like to extend my thanks to the RapidMiner team for working so hard on the tutorials to make the topic as accessible as possible. Your software is a joy to use. Now onto my problem.
I'm performing tool testing as part of a student assignment where I have to compare RapidMiner and Weka in both experimental results and in general. I'm having some problems currently with the experimental part of my assignment. My task is to compare three RapidMiner implementations of classification algorithms with three of Wekas. In my case this means DecisionTree vs. J48, k-NN vs. iBK and respective implementations of NaiveBayes. Parameters are default, except that I have disabled Laplace smoothing for NaiveBayes. I've used 10-fold Cross validation, using Performance (Polynominal) operator.
The accuracy of RapidMiner is fine and compares well to Weka's implementations, DecisionTree does better in most cases as a matter of fact. The recall and precision are somewhat troublesome though. Consider the following tables:
Precision: https://gyazo.com/ced749cebc185b4b70a0a077188cf17f
Recall: https://gyazo.com/4bbacf1ff196671a36d4c38220e25c22
As you can see for the majority of cases, Weka has better results. I was hoping if you could enlighten me as to why. Am I doing something very wrong or is there something else afoot?
Kind regards,
Alex
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
The operator Generate Weight (Stratification) does the trick
~Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
What type of Validation are you using? Split? X-val?
I'm using 10-fold X-Validation operator yeah. What I've found is that Weka weights things differently from RapidMiner. Whereas the default weights for RapidMiner are 1 for all classes, Weka weighs classes based on how often the class occurs in the set (from what I could understand anyway). The more a certain class occurs, the bigger the weight. This means that the weighted average of precision and recall values in Weka are skewed when compared to RapidMiner's approach.
Since I couldn't find a solution to adjust the weights appropriately (because I'm too green or otherwise), I've since done some manual spreadsheet work to normalize Wekas weights and the results are much more comparable now.