The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to run a prediction model on a dataset without spliting it to train and test datasets

mansourmansour Member Posts: 26 Contributor II
Hello everyone,

I hope this message finds you well. I am currently working on a project that involves running RapidMiner prediction models on a dataset. Specifically, I am interested in using tree induction, SVM, DM, and other models to predict outcomes and determine prediction accuracy. 

However, I am faced with a challenge in that my dataset only contains 60 samples, which makes it difficult to split it into training and testing datasets. Therefore, I am reaching out to you to see if anyone has any suggestions on how I can proceed with running the models without having to split the dataset.

I greatly appreciate any insights or advice you may have on this matter.

Thank you,
Mansour

Answers

  • rjones13rjones13 Member Posts: 204 Unicorn
    Hi Mansour,

    I would look to use cross-validation instead of using a train/test split. This is a more advanced, iterative technique, that uses folds in the data to both train and test on the entire dataset.

    I might suggest starting with 5 folds, based off the size you said, and improve from there. Here’s a good blog on cross validation where you can get a little more information: https://rapidminer.com/blog/validate-models-cross-validation/. I believe we also have a video on the Academy.

    Best,
    Roland
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi Mansour,

    you could also look into "Leave one out" validation. This is a cross validation with as many steps as there are data rows - in your case 60. 

    The Cross Validation operator has a parameter for switching this on.

    This approach will take the first example as the test set and the rest of the data for training, then the second one, and so on. With this method each example will be tested with a model built on the rest of the data and you will get a robust estimate of the model quality.

    A final model will be built on all data if you connect the model output of Cross Validation.

    Regards,
    Balázs
Sign In or Register to comment.