The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Over-fitting problem

IgnacioIgnacio Member Posts: 7 Contributor II
edited November 2018 in Help
Hi,
I´m working with 1000 attributes, 8000 examples and only 2,5% positive cases. To train the model, I used under sampling (25% positive, 75% negative). At first, I optimized model parameters. Then I used a Forward Selection of variables followed by a Backward Selection of variables, with different "keep best" (1, 2, 5, 10). My performance in the train part is 0,824. When I tested, the performance is 0,742. I´m always working with x-validation. I don´t figure out where is the over-fitting problem. Am I using the correct sampling? Should I use over-sampling or a different under-sampling?
Thank you very much,
Ignacio

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi!

    Does your x-Val include the Forward selection and the Optimization? Otherwise you can easilty overfit (Just take the attributes, which are good for this specific (sub)set.)

    Could you maybe provide an example process doing this? What is the Std_dev for the 0.824?

    To improve performance, i would recommend using weights.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IgnacioIgnacio Member Posts: 7 Contributor II
    Yes. The std_Dev is 0,012.

    I am using weights. I think the problem might be in the sampling process.

    Is over-sampling a good idea?

    Ignacio
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    why do you want to sample if you use weights? It has a very similar effect. Are you sure that this does not change your performance in training and testing?
    And again - is your Feature Selection and optimization inside your X-Val? Otherwise you will overestimate your performance.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IgnacioIgnacio Member Posts: 7 Contributor II
    Hi,

    Im sorry I mixed the terms, I used sampling with 75% for training and 25% testing. The undersampling I did was 50/50 positive/negative cases for training, testing was left 2.5/97.5.

    First I got parameters for a svm using top 100 correlated attributes, then I used those parameters for a forward+backward. Two different processes. In both cases the x-val was INSIDE the optimize parameters / forwards. Are you saying it should be the other way around, with the optimizers inside a single x-val node? Each fold tested against what then , the same training fold? Or do a x-val inside as well?

    I haven{t tried weighting, but I read it doesn{t work every algotiyhm in rm. I am using 5.3.015, what algorithms should I try with it?  I normally use svm, libsvm, neural net, k-nn, bayes, decision trees, logisitic and linear regression.

    Thank you!

    Ignacio
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi Ignacio,

    your procedure might go into overfitting. You might choose the attributes (=Dimensions) which are well suited for your special subset of data. Think about binominal attributes coding wether a customer lives in a City or not. If you optimize on that, you can overtrain on "People from Springfield", which is overtraining.

    To do it correctly you need to do

    X-Val, inside Optimize Parameters, inside Feature Selection and X-Val.

    This takes a lot of time. So if you have enough data you might do the Feature Selection on a "Hold-Out" set, which is then not used in the Optmization anymore.

    For the weighting: You can click on an operator and than use f1 to see what's supported. There is a entry for weights.From a first look your operators should support weights.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.