The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Model performance estimation

npapan69npapan69 Member Posts: 17 Maven
edited January 2019 in Help
Dear All,
I have a relatively small dataset with 130 samples and 2150 attributes, and I want to built a classifier to predict 2 classes. Apparently, I need to reduce the number of attributes to avoid overfitting, so I could use i.e. RFE-SVM to reduce the number of attributes to 1 tenth of my samples, which is 13. I'm using a Logistic Regression model, and I need to do some fine tuning of parameters like lambda and alpha. After reading the very informative blog from Ingo, I would like some help on the practical implementation. May I kindly ask from a more experienced member to check the following workflow? Can I trust this implementation and in particular the performance estimates? Is it a good practice to compare the performance from CV with that from a hold-out single set? And if yes these numbers should be more or less the same?



Many thanks in advance,

npapan69
Tagged:

Best Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted
    Cross validation is generally believed to be more accurate than a simple split validation.  Split validation measures performance based only one one random sample of the data, whereas cross-validation uses all the data for validation.   Think about it this way---the hold-out from a split validation is simply one of the k-folds of a cross-validation.  It's inherently inferior to taking multiple holdouts and averaging their performance, which provides not only a point estimate but also a sense of the variance of the model performance as well.

    It's different if you have a totally separate dataset (sometimes called an "out of sample" validation, perhaps from a different set of users, or different time period, etc.) that you want to test your model on after the initial construction.  In that case your separate holdout might provide additional insight into your expected model performance on new data.  But in a straight comparison between split and cross validation, you should prefer cross validation.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • rfuentealbarfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited November 2018 Solution Accepted
    On another note:

    Now that my sensei @Telcontar120 mentions it, you have two files: one is filename75.csv and the other is filename25.csv, right? (say yes even if it's not the same name).

    If you did that because you want to replace the filename25.csv file with data coming from elsewhere, the process you wrote (and then, the process I sent you) is fine. If you did the split because your target is to prepare that model and perform a split validation after a cross validation, that's not really required. It's safe to use Cross Validation as a better thought Split Validation (until Science says otherwise, but that hasn't happened). In that case, your question:
    Should I trust one or the other?
    Be safe trusting the Cross Validation.

    In the case I sent, I assume that your testing data is new data that comes from outside your sample. A good case to do that is what happened to me in my oceanic research project:
    • Trained my model with a portion of valid data from 2015 and 2016.
    • Tested my model with a portion of valid data from 2015 and 2016, but different chunk of it.
    • Then I have data between 2009 and 2014 that is outside of my sample and I want to score it.
    My question is: should I use a new performance validator? 
    • If what I want to validate is how my algorithm behaves, then no, one validation is enough.
    • If what I want to validate is the way historical data has been scored, then yes, you might see if your algorithm holds against older data: one validator for the model and other for the old data on applied model data.
    • Everything else, no.
    So, rule of thumb: if what's important is the model, go with Cross Validation. if it's historical data that is also scored, perform the validation yourself. If it's new data, don't validate anything, because your new data will be predicted true, not really true and validations ALWAYS  come from data you already know.

    Hope this helps.

Answers

  • MaerkliMaerkli Member Posts: 84 Guru
    Rodrigo, it is brillant!
    Maerkli
  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @Maerkli if you like pls use new "reaction" tags: Promote, Insightful, Like, Vote Up, Awesome, LOL :smile:
  • npapan69npapan69 Member Posts: 17 Maven
    Dear Rodrigo,

    Thank you for taking the time to respond in detail in my post. Let me clarify, in the -omics sector (on which I'm working) it is very common to have far fewer samples (horizontal entries), than attributes (vertical entries) or features. Therefore various methods are recruited to cone down to the few most informative features that will comprise the -omics signature. In the xml file you will see that apart from RFE I'm removing highly correlated features, as well as features with zero or near-zero variance (useless features). As a rule of thumb someone could consider to use for every feature that will finally contribute to the model at least 10 samples. So given the 130 samples available I'm not suppose to exceed 13 features after the feature reduction techniques applied. Actually by watching Ingo's webinar, I will try the evolutionary feature selection techniques keeping the maximum number of features to be 13. Now the most important part for me is how to validate the model. In our field external validation is considered as the most reliable technique, however, its not very easy to get external data. So if I dont have external data, is it correct to start with a data splitting before doing anything else and to keep 25% of the data, as a hold out test set, train and save my model and afterwards test it with the hold-out set? Or forget about splitting and report (and trust) CV results? Is there a way to do repeated cross validation (like 100 times for example)?

    Again many thanks for your time and greetings from Lisbon to the beautiful Chile.

    Nikos 
  • npapan69npapan69 Member Posts: 17 Maven
    Many thanks Rodrigo for taking the time to answer in such a detailed way my post. In the -omics field that I'm working its very common to have few samples and way too many attributes, therefore feature selection methods are very important to reduce overfitting. In my feature selection approach (as you will see in my process) I start by removing useless and highly correlated features and then apply RFE-SVM. As a rule of thumb the maximum number of features that will finally comprise the model (signature) should not exceed the 1/10 of the total number of samples used to train the model. Now the question is if my approach using a nested cross validation operator to select features, train and fine tune the model using 75% of the samples while testing the performance with the 25% of samples test hold out set is correct. And if yes the difference in my performance metrics (accuracy, AUC, etc) between the CV output and my test data output should be minimal? If not is that a sign of overfitting? Should I trust one or the other? Should I verify the absence of overfitting by comparing the 2 outputs?

    Nikos
  • npapan69npapan69 Member Posts: 17 Maven
    edited November 2018
    Again many thanks Rodrigo, for your enlightening answer, and the time devoted to correct my process.
     
    Best wishes,
    Nikos
  • npapan69npapan69 Member Posts: 17 Maven
    Dear Rodrigo,
    I must admit that I couldn't find a way to evaluate the training and test data variance by X-means. Probably this is very basic, and I apologise for that, but the X-means operator can receive only a single file as input, and I guess I have to provide 2 files as inputs (75% training, 25% testing). Any workarounds?

    Many thanks
    Nikos
  • rfuentealbarfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hi @npapan69

    Sure, just use the Append operator to merge both files as a single one. Make sure that most of the columns have the same names and that's it.

    All the best,

    Rodrigo.
Sign In or Register to comment.