How do I apply predicted test dataset on random unlabeled validation data set

Akshay21 · May 2020

I have trained model (Decision Tree) based on training set (0.8) and tested the trained model on test dataset (0.2) and got the accuracy results with performance operator.
However, I have random unlabeled validation dataset which i.e external example set, and I want to apply the trained and tested model on validation dataset and get the accuracy and confidence interval results for that validation set.
How do I proceed? Please suggest as soon as possible.

lionelderkrikor · May 2020

Hi @Akshay21,

If your validation set is not labeled, you can not have the accuracy (ie the proportion of corrected classified examples) of the model on this validation set.
The accuracy of a model can be obtained only if you are providing the "True labels" and the "predicted labels" : In your case, you don't have the "True labels".

Regards,

Lionel

lionelderkrikor · May 2020

Hi @Akshay21,

As per my unlabeled validation set does not have 'True' labels, we will get only predictions randomly and their confidence ratios. Right?

With your what you call "unlabeled validation set", you will have the predictions after applying your trained model (trained with your training set and validated with what your call your "test set" in your case) to your "unlabelled validation set".
And yes, you will have confidences for each classe of your label.

How can I be sure that after testing test data on validation set we should deploy the model?

We can never be sure at 100% that the performance calculated on your test set is absolutely representative of the future performance of the production model on unseen data.
The "best practice" (aka as the "gold standard") is to use a "k-folds cross validation" to validate your model but it implies to create k model(s) which may require significant computing time if you have a huge dataset.
A good compromise is to use a "multi-hold-out set validation" : In this case you build one only model and thus it doesn't significantly require as many resources as "cross validation". FYI, "multi-hold-out set validation" is the validation applied in RapidMiner's Auto-Model to calculate the performances of the models. You can look at the documentation in the results screen (the final screen) of the Auto-model process about this validation method.

Two more resources for you :

About cross validation, an complete article written by @sgenzer :
https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio

About how to correctly validate a model, a complete article written by Dr Ingo Mierswa (@IngoRM) (in attached file).

Hope this helps,

Regards,

Lionel

Akshay21 · May 2020

Thanks for the information.

One more thing, is there any relation between confidence(True), confidence(False) on validation set and accuracy?

As per my unlabeled validation set does not have 'True' labels, we will get only predictions randomly and their confidence ratios. Right?

How can I be sure that after testing test data on validation set we should deploy the model?

Any suggestions would be helpful.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How do I apply predicted test dataset on random unlabeled validation data set

Best Answers

Answers