Model selection for imbalanced training dataset
Hi RapidMiner,
I'm doing model selection for SVM using the "Optimize Parameters (Grid)" operator, my training dataset is imbalanced/skewed (782 positive examples and 2048 negative examples), so we cannot use Accuracy (= (TP+TN)/(TP+TN+FP+FN)) as a score for model selection (because if the predictor predicts everything as negative, the accuracy will easily reach 2048/(2048+782)= 72.3%). So may I ask if there is a way to choose Precision and Recall, or a combined function of them like F1 score instead of Accuracy? I did look into the parameter list of Performance operator but could not see those scores. Or is there other way to deal with imbalanced dataset like this?
I attach my process file here. In this process, I use "Optimize Parameters (Grid)" operator to find the SVM's hyper-parameters that give the best cross-validation performance. This process works very well on a balanced training dataset, now I wonder how to modify it for an imbalanced one. Thank you very much for your help!
Best Answers
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751
RM Founder
Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".
Hope this helps,
1 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635
Another option is to add weights to balance the classes, since the SVM operator accepts weights. But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.
Thank you Ingo,
I've already seen the scores in the "Performance (Binominal Classification)" operator!