The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] evaluation of resampled dataset
Hello everyone!
This is my first post, so first things first. This is a great peace of software and you guys deserves nobel price for making it free for community. THANK YOU.
Now, here is my question, probably little stupid but I want to be sure. So I have unbalanced dataset, so I overcome this by undersampling majority class, or by applying weights or somehow make it balanced for training. But I must evaluate the performance, by cross-validation or split validation on original unbalanced dataset ,right? The balanced part is only for training, right?
Thank you in advance.
This is my first post, so first things first. This is a great peace of software and you guys deserves nobel price for making it free for community. THANK YOU.
Now, here is my question, probably little stupid but I want to be sure. So I have unbalanced dataset, so I overcome this by undersampling majority class, or by applying weights or somehow make it balanced for training. But I must evaluate the performance, by cross-validation or split validation on original unbalanced dataset ,right? The balanced part is only for training, right?
Thank you in advance.
0
Answers
for most classifiers it is indeed important to have a more or less balanced training set. If you use the true distribution for testing depends - measures like the ROC plots or the AUC and also the Recall are independent of the distribution, whereas the accuracy and the precision (and many other measures) depend highly on the distribution. So to get a good estimate for those you should use the true distribution for testing.
Best regards,
Marius
thank you for prompt and really helpful answer.
I would like to ask one more explanatory question, if you don't mind. I am just trying to understand why is recall the same. Is it because, when classifier is learned on balanced dataset, it is able to predict minority class on unbalanced test set equally well. It simply gets more "opportunities" to missclasify majority class as minority in skewed dataset and that is why precision for minority class is lower?
Thank you.
Matus
exactly right.
Best regards,
Marius
Best regards,
Matus