The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What model should I use ( training, validation or testing )
cliftonarms
Member Posts: 32 Contributor II
I am seeking a little "best" advice on the live prediction model application, as I am a little confused what approach is normally adopted.
The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.
The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10% of unseen data.
My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :
1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model with the 90% training data ( without x fold ) using the best settings found from the x-validation training.
3) Do I create a model on 100% data ( 90% training and 10% unseen ) with the best settings found from training.
Thank you in advance for your time.
The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.
The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10% of unseen data.
My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :
1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model with the 90% training data ( without x fold ) using the best settings found from the x-validation training.
3) Do I create a model on 100% data ( 90% training and 10% unseen ) with the best settings found from training.
Thank you in advance for your time.
0
Answers
With a large dataset you could go with (2) select based on training/test. You can do without X-validation here.
Whatever you pick Don't do (3) ever as you face the risk of over-fitting the data badly.
There are some authors who recommend splitting the dataset into training/test/validation. Train your models in the training set. Compare the models in the test set. Pick the best. Estimate the error rate of the best model again in the validation set.
Can I just confirm - you are advocating using the "best" model created by the 10 fold x-validation method, and not retraining the model using the "best" model settings but on the complete data set.
I have a question. Do you apply the trained model with the model applier right after the XValidation or do you have to train again over the whole training set after having applied the XValidation? I am asking because in case you do a Feature selection with an inner XValidation, You don't get a model out of the feature selection (there is no connection point). However you could save the model with a "remember operator" inside the FS and call the model outside the FS operator and combine it with the feature weights operator for the unseen testset. But I think one has to retrain over the full training set with the selected features right?