The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How is data split into training / test sets in rapidminer GO?

cramsdencramsden Member Posts: 42 Learner III
I am using rapid miner go and would like to know how data is being split into training / test sets and if its the same for each method (deep learning, gradient boosted trees etc.).

Is there anywhere in the docs that say first 70% of rows are used for training or something like this?

Thank you

Answers

  • aleboalebo Employee-RapidMiner, Member Posts: 15 RM Product Management
    Hi Chris, 
    We use a 60/40 split for every model. If the target column is nominal, Go builds random subsets and ensures that value distribution is the same as in the original dataset. Otherwise, Go builds subsets randomly. 
    Regards,
    Andras
  • cramsdencramsden Member Posts: 42 Learner III
    Thank you,

    Is there anymore information on this?  I am new to data science and self teaching, so I'm a bit confused by the terminology.  

    I am asking because I am noticing a difference in the predictive power of my models based on which order the data set they were built on was originally uploaded.

    To clarify, the data is 60 / 40 split but what goes into the 60 and 40 respectively is done randomly but ensuring the same distribution is kept?

    Or is it the first 60% of rows and last 40% of rows for the split?


  • aleboalebo Employee-RapidMiner, Member Posts: 15 RM Product Management
    You can find free learning materials on https://academy.rapidminer.com/. I would recommend checking it out.
    For example, on validation I found the following materials, that could help you:
    Unfortunately, these are mostly focused on RapidMiner Studio processes, but Go uses the underlying data science practices. You might want to experiment with Studio as well, as it gives you lot more flexibility than Go.

    When splitting data, Go always shuffles the dataset. In case of a nominal (categorical) label, Go ensures the same distribution. 

    Regards,
    Andras
Sign In or Register to comment.