Ideal ratio with respect to scoring dataset and training dataset

Abi · April 2020

Like the 70 - 30 ratio for trainig and testing, is there a suggested ratio for the datasets of training and scoring?

(This is so as to reduce the training data to the correct proportion for best scoring)

varunm1 · April 2020

Hello @Abi

70-30 is a general ratio that you find in many processes where split validation is used. I really like the validation used in the Auto model. So, What auto model does is, it train a model on 60% of data and then score on 40% data. The way it scores 40% data is by splitting this 40% into 7 subsets and test on each subset and then average the performance of these 7 subsets. This way it is also having the advantages of cross-validation by splitting into subsets.

My suggestion, go with 60% training (Cross validate) and 40 % testing (divide into 7 or 5 subsets) for scoring. If you can cross-validate whole data, that is fine as well, but test the model on at least 10% hold out data after CV.

Telcontar120 · April 2020

The ideal ratio is to use cross validation.

There is a reason this is considered the "gold standard" for validation. This approach ensures that 100% of the data is used in both training and testing. Otherwise you are inviting bias from random effects of which records are in your training set vs your testing set.
I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.

hbajpai · April 2020

Hey @Abi,

Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.

As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model.

MartinLiebig · April 2020

@hbajpai ,

Scoring typical is real time rather than batch.

I would challenge you on this. In Customer Analytics its often fine to do scorings once a day / once a week.

Best,

Martin

varunm1 · April 2020

Totally agree with @Telcontar120 on CV. If one cannot afford to implement CV due to time constraints, huge data or specific needs, then other validation similar to AM can be used

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Ideal ratio with respect to scoring dataset and training dataset

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers

Be Safe. Follow precautions and Maintain Social Distancing