The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Criterion for overfitting evaluation

Hung_Bui_221Hung_Bui_221 Member Posts: 5 Learner I
Hello everyone. Have a nice day. I am getting some overfitting trouble. I have been searching the information on RM Community and the other websites. They told that if the accuracy is greater than 90%, I am most probaly facing to overfitting. My case below:

I have the datasets like this:

Then I created the process using classification (decision tree) with the bank-additional-full.csv as training data and bank-additional.csv as test data. After running, the accuracy is about 97% (and the correlation is about 79%).

I think this is overfitting. Is it correct? If yes, how can I fix this problem? And is there only accuracy to evaluate the overfitting? Please help me. Thank you.  o:)

Best Answers

  • MarcoBarradasMarcoBarradas Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Solution Accepted
    Hi @Hung_Bui_221

    This videos might help you clarify. You might have an accuracy of 97% whats the recall on the thing that you are trying to predict?

    https://academy.rapidminer.com/learn/article/overfitting-outliers


    Introduction to Performance Measurement

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    Just having a high accuracy doesn't mean that you have overfitting. You could also have a good model.

    Look at the decision tree and try different pruning parameter settings to control the possibility of overfitting. You'll be able to see if the tree is getting very complex and making nonsensical decisions (like "if first name = Peter then Label") or not. 

    An overfitted model doesn't work well on new data. Therefore, you just need to make sure that you verify correctly. See these videos in the Academy:
    https://academy.rapidminer.com/learn/video/validating-a-model
    https://academy.rapidminer.com/learn/video/optimization-of-the-model-parameters

    Regards,
    Balázs
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    Bagging and other ensemble methods can help reduce overfitting and make models more robust. When you obtain 10 trees in a bagging model, that's the model. It is probably as good or better than just one tree. 

    With tree based methods, correlation is not that big of a problem. When one attribute is selected for a split, correlated attributes don't really matter. 

    If you suspect that the correlation of polynominal attributes might be worse for your model, you should validate that assumption. A good way to test this is using Nominal to Numerical which will re-code the nominal attribute values to new 0/1 attributes. Then you could apply similar correlation based filters. 

    Regards,
    Balázs

Answers

  • Hung_Bui_221Hung_Bui_221 Member Posts: 5 Learner I
    Hi @BalazsBarany @MarcoBarradas . Thank you for helping me. Now I can understand better about overfitting issue. Here is my result after running the process:



    Besides, I have 2 more questions:

    1. In Optimize, I use Bagging (with Decision Tree inside) because as I known, this is also a way to reduce overfitting issue. Is it correct? After running, I obtained 10 trees. How can I know which tree should be chosen?

    1. As I known, the highly correlated attributes should be removed. So I used Weight by Correlation for numerical and binominal attributes and then removed which ones have correlation greater than 0.95. But how about polynominal attributes? At first I used Weight by Information Gain and Select by Weight for them. Then I was confused and change into Correlation Matrix for all attributes like the image above. In this case, what should I do?

    Sorry for long post. And thank you again for noting my questions. 
Sign In or Register to comment.