The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Predictive model for rare occurrences
Hi fellow RapidMiners,
What kind of model would you suggest I should look into when trying to predict a binary outcome with a very high class imbalance (97/3)? The problem at hand is medical readmission within 30 days for surgery. Any suggestions would be appreciated. Currently I am planning to test the k-NN algorithm looping through different k-values.
Best regards.
What kind of model would you suggest I should look into when trying to predict a binary outcome with a very high class imbalance (97/3)? The problem at hand is medical readmission within 30 days for surgery. Any suggestions would be appreciated. Currently I am planning to test the k-NN algorithm looping through different k-values.
Best regards.
0
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @Casper72,
In your case, I advice you to preprocess your data by upsampling your dataset before modelling.
For that, you can use the SMOTE Upsampling operator from the Operator Toolbox extension available for free in the MarketPlace.
Regards,
Lionel
0 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornYou can try weighting instead to balance the classes (although not all ML algorithms support weighting). This might give better results than upsampling with such a small minority class.0
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornTake a look at the tutorial for the Generate Weight (Stratification) operator, that should be the one that you would use.0
Answers
I will try using SMOTE. Have used it before with success, although with less imbalanced datasets (typically in the range of 30/70)