The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Getting reliable results. Which model to choose?
cliftonarms
Member Posts: 32 Contributor II
in Help
Advice kindly sought from any "seasoned" data predictors / miners out there.
I have created an experiment within Rapidminer to iterate through different inputs and modelling configurations, attempting to find the "best prediction fit" for my data.
The data : consists of 3100 rows of learning data and 300 rows of unseen testing data.
Each dot on the graph below represents an individual model plotted at its learning performance vs testing performance. ( the scale is not relevant )
My question is : which model should I choose to produce the most reliable and robust prediction of new "unseen" data?
I have created an experiment within Rapidminer to iterate through different inputs and modelling configurations, attempting to find the "best prediction fit" for my data.
The data : consists of 3100 rows of learning data and 300 rows of unseen testing data.
Each dot on the graph below represents an individual model plotted at its learning performance vs testing performance. ( the scale is not relevant )
My question is : which model should I choose to produce the most reliable and robust prediction of new "unseen" data?
- Choose a model from the ORANGE area where the training performance was very good, but the testing performance was poor.
- Choose a model from the BLUE area where the training performance was good, but the testing performance was good.
- Choose a model from the GREEN area where the training performance was poor, but the testing performance was very good.
0
Answers
Learning performance is the prediction profit averaged over ALL 3100 rows of training data.
Testing performance is the prediction profit averaged over ALL 300 rows of testing ( unseen ) data.
The higher the profit the better the performance of the prediction system.
My problem is the trained models do not perform with the same prediction rate on the unseen data ( obviously ), so its how to choose the best model to go live with.
It seems that you have a label for your test set (which you need to measure the performance). My suggestion would be to join the training- and test data and apply a parameter optimization. Within this optimization operator you should use the cross validation (X-Validation) to calculate the performance for a specific parameter set. Basically, a cross validation will partition the input into k subsets. The model will be learned on k-1 subsets and tested on the k-th subset. This will be repeated until every subset was used as test set exactly once. In the end an average performance will be returned. This gives us a more reliable performance measure to select the best parameters for a learner.
After you have found the best parameter set, use this to learn a model on the whole data set and use this model to predict completely unseen data.
Its how to choose which model to go with - should I just go with best unseen data performance ? i.e the models in the green circle on the graph.
With your example ( and I understand it is just an example ) you have no control of the model generated and its performance on unseen data. Sort of "open loop" optimisation.
I automatically varying the : attribute selection via weighting ( 17 methods ), number of attributes ( 5-57 ), classification kernal ( SVM ) and model parameter ( c / gamma / nu). Each run through an x-validation on the 3100 Learning rows. So all the legwork is done automatically.
So I am varying everything to give me a large set of possible models.
I then apply these models to the unseen data to check the performance of each model.
I don't think its the the model creation I have a problem with. Its after the model is validated do I just pick the model that performs best on "unseen data", is it that simple. e.g. The model with a score of 16.5 for Testing performance ( unseen ) in the green circle on the graph above.
Best regards,
Wessel
The only reason it is not a % figure is the % prediction correctness is not a useful measure of performance with this system. So the performance number is calculated after the data is applied to the model from the prediction results..
Out of the cross-validation operator comes a model, yes.
This is the model on entire training data, yes.
But this is NOT cross-validation performance.
You should make the same figure where you use "cross-validation performance" on 1 axis.
Best regards,
Wessel
The "training performance " is the 3100 rows of training data applied directly to the model generated by the result of the x-validation process.
The actual average x-validation performance is not captured as it does not represent a reliable performance measure in this scenario.
If this is the case, you might as well have not done any x-validation.
Just generate the figure?
Then you have full-training set performance, x-validation performance, and hold out set performance.
I would like to see the x-validation performance and hold out set performance figure.
As far as I'm aware this is 5 minutes work right?
The problem I have is even though my problem space is fundamentally a binomial classification task, each individual prediction carries a different cost.
For example ( although this is not my problem ) : Blood pressure classification = blood pressure too High or blood pressure too Low.
However, a miss-classification of someone whos blood pressure is SLIGHTLY too high/Low is far less serious than miss-classifying someone whos blood pressure is VERY high/low. Hence the % validation performance is useless, I add a unique cost ( actual blood pressure variance around normal ) to each classification prediction, and average over all examples predicted to find the total system performance.
This statement is untrue.
First of all, it is written cross-validation or x-validation, not % validation.
Secondly, x-validation is a sampling process, has nothing to do with classification cost.
You should simply change your "measure" of performance to reflect this cost.
A trick that is sometimes used to reflect different costs while using "standard accuracy" as a performance measure, is copying instances with high cost.
This can enhance your learner to pick up on the correct patterns.
Also, check out the "Performance (Costs)" operator!
Best regards,
Wessel
The models in the orange area are guaranteed to be bad at generalizing and therefore any prediction can be expected to be bad.
The models in the green area are the ones that are more challenging, I think they are a result of how the experiment is being conducted, having such models with a really poor performance in the "learning" set and very good performance in the "testing" performance seem to be a result of randomly choosing parameters and just a coincidence on being good in that particular testing set, not a result of having a good model. I think that it would be a fair statement to say that these models aren't approximating the overall error surface in a good manner, so even though they are being good in approximating the testing cases, you can't rely on that model.
Hope this point of view helps.