Model selection for imbalanced training dataset

phivu · December 2016

Hi RapidMiner,

I'm doing model selection for SVM using the "Optimize Parameters (Grid)" operator, my training dataset is imbalanced/skewed (782 positive examples and 2048 negative examples), so we cannot use Accuracy (= (TP+TN)/(TP+TN+FP+FN)) as a score for model selection (because if the predictor predicts everything as negative, the accuracy will easily reach 2048/(2048+782)= 72.3%). So may I ask if there is a way to choose Precision and Recall, or a combined function of them like F1 score instead of Accuracy? I did look into the parameter list of Performance operator but could not see those scores. Or is there other way to deal with imbalanced dataset like this?

I attach my process file here. In this process, I use "Optimize Parameters (Grid)" operator to find the SVM's hyper-parameters that give the best cross-validation performance. This process works very well on a balanced training dataset, now I wonder how to modify it for an imbalanced one. Thank you very much for your help!

IngoRM · December 2016

Hi,

Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".

Hope this helps,

Ingo

Telcontar120 · December 2016

Another option is to add weights to balance the classes, since the SVM operator accepts weights. But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.

phivu · December 2016

Thank you Ingo,

I've already seen the scores in the "Performance (Binominal Classification)" operator!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Model selection for imbalanced training dataset

Best Answers

Answers