The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Model selection for imbalanced training dataset

phivuphivu Member Posts: 34 Maven
edited November 2018 in Help

Hi RapidMiner,

 

I'm doing model selection for SVM using the "Optimize Parameters (Grid)" operator, my training dataset is imbalanced/skewed (782 positive examples and 2048 negative examples), so we cannot use Accuracy (= (TP+TN)/(TP+TN+FP+FN)) as a score for model selection (because if the predictor predicts everything as negative, the accuracy will easily reach 2048/(2048+782)= 72.3%). So may I ask if there is a way to choose Precision and Recall, or a combined function of them like F1 score instead of Accuracy? I did look into the parameter list of Performance operator but could not see those scores. Or is there other way to deal with imbalanced dataset like this?

 

I attach my process file here. In this process, I use "Optimize Parameters (Grid)" operator to find the SVM's hyper-parameters that give the best cross-validation performance. This process works very well on a balanced training dataset, now I wonder how to modify it for an imbalanced one. Thank you very much for your help!

 

 

 

Best Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".

     

    Hope this helps,

    Ingo

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Another option is to add weights to balance the classes, since the SVM operator accepts weights.  But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • phivuphivu Member Posts: 34 Maven

    Thank you Ingo,

    I've already seen the scores in the "Performance (Binominal Classification)" operator!

Sign In or Register to comment.