Performance Measures for Imbalanced Data
Hi All !
My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?
Answers
Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Dortmund, Germany
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced :
https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve
If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.
Regards,
Lionel
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts