The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

SMOTE

noritanorita Member Posts: 29 Learner III
Hi


My binary classification problem is imbalanced. (In 5% of the cases the outcome occurs)
I used SMOTE for the variable selection and training of the model.

SMOTE derives from the paper in the link above.

In this paper mentioned it is written: "a combination with the method of over-sampling
the minority class and under-sampling the majority class can achieve better classifier performance than only under-sampling the majority class."

My question now is: Is applying SMOTE  not sufficient to address the imbalanced problem.Or do I need to add aditionally an operator for "under-sampling the majority class"?



Answers

  • yyhuangyyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @norita,

    If there're too many samples in the majority class, you can add down-sampling (w/ "sample" operator) before SMOTE. You may also use some similarity analysis to identify the similar data points in the majority class and size down this population with simple filters. Some R/python library are helpful to under-sample with sophisticated algorithms, e.g. Edited Neared Neighbor Rule, Condensed Nearest Neighbor Rule, TomekLinks, One-sided selection, Neighborhood Cleaning Rule,...

    Note that ROC curve can not measure the performance of classifiers well on imbalanced data. Because TPR only depends on positives, ROC curves do not measure the effects of negatives. AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. Try the Precision-Recall curve on the imbalanced data. 

    Cheers,
    YY
  • noritanorita Member Posts: 29 Learner III
    Thank you very much!
    I will have a deeper look on the paper later. Thank you! Yes, performance measures are a very delicat topic I think especially for the case (mine) of the internal validation with the SMOTE manipulated data and afterwards the external validation with data  with the original prevalence. (5% vs 95% of the different outcomes)

    Still a question remains for me on the topic of SMOTE. I did only SMOTE I only oversampled the minority class to have equal sizes of the different outcomes for the model development.
    Do I have to have some concerns that I adressed the imbalance problem only by SMOTE and not combined with undersampling the overrepresented outcome.
    Am I right that the paper of the author of SMOTE only stated the positive effect if its used in combination with undersampling.

    Is it usual to only obtain SMOTE to equal size of the comparator groups?
Sign In or Register to comment.