The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Applying a pre-trained model on new data
Hello
I have the following concern. If I apply a model on data that have slightly different set of features that the data the model was trained on - what happens with the values of attriubtes not present in test data but present in the models and vice versa, is that a problem for the model to be applied correctly?
This problem occurs in text classification, as features are words, and feature set becomes wordlist. When I extract wordlist from a set of training documents, and then want to classify a new document, it is obvious that the features of new document will be different. How should this be handled?
I would expect that applying the old model on new data would anyway bring the same results as if the features vere extracted collectively, as missing values would be assumed 0, and they were anyway not present in test data. But, I have compared these two approaches:
1. Extracting features from all data set, dividing data to test and training data, learning classifier and measuring the accuracy
2. Dividing data to test and training data, extracting features from each set independently, and learning classifier and measuring the accuracy
and I found out that in the second case the classificaton accuracy is much lower (it went to 20% from 70%). Is that something I should have expected, is my logic wrong here? Is there any way to "fix" the new data to match the old model, or fix the old model to match the new data? Or am I having totally wrong approach here?
Regards
Monika
I have the following concern. If I apply a model on data that have slightly different set of features that the data the model was trained on - what happens with the values of attriubtes not present in test data but present in the models and vice versa, is that a problem for the model to be applied correctly?
This problem occurs in text classification, as features are words, and feature set becomes wordlist. When I extract wordlist from a set of training documents, and then want to classify a new document, it is obvious that the features of new document will be different. How should this be handled?
I would expect that applying the old model on new data would anyway bring the same results as if the features vere extracted collectively, as missing values would be assumed 0, and they were anyway not present in test data. But, I have compared these two approaches:
1. Extracting features from all data set, dividing data to test and training data, learning classifier and measuring the accuracy
2. Dividing data to test and training data, extracting features from each set independently, and learning classifier and measuring the accuracy
and I found out that in the second case the classificaton accuracy is much lower (it went to 20% from 70%). Is that something I should have expected, is my logic wrong here? Is there any way to "fix" the new data to match the old model, or fix the old model to match the new data? Or am I having totally wrong approach here?
Regards
Monika
0
Answers
how missing or additional features are handled, depends on the classification algorithm and its implementation in RapidMiner. In general, the behaviour is undefined, though in some cases you may get reasonable results.
However, in your case with text classification you can guarantee that both in training and in testing the same features are generated, by connecting the wordlist output of the training Process Documents operator the the wordlist input of the Process Documents operator in the testing branch. Have a look at the attached process.
Best,
Marius
(Marius was faster so I have thrown away most parts of my message about using the wordlist - Marius is perfectly right here...)
So here are only two comments: Exactly, and for this reason I would suggest the following as a golden rule: ALWAYS make sure that the attributes used for training and model application are exactly the same. In case of text mining, as Marius has pointed out, this can be done by using the word list from the training process also for the text processing of the application / testing data. Yes. The reason is: you have cheated. If you use both training AND test set for the word vector creation, you put information about the distributions of the test set into the training already. This - as happened here - frequently lead to overoptimistic estimations of the predictions accuracy (althogh related: don't confuse this type of cheating with overfitting).
The strong thing of RapidMiner is that preprocessing is never done automatically during learning and so you can actually control the preprocessing and see its impact on the prediction accuracy. The downside is, that those correct estimations delivered by RapidMiner (if the process setup is done correctly) are almost always worse than the cheated ones delivered by many other solutions. You can see this not only for text preprocessing but also for parameter optimations, attribute selection, attribute weighting, attribute construction...
I strongly believe that this fair and true evaluation is important not only in science but also for real-world applications. I don't like bad surprises and I also want to know if I can truly stop optimization since I am good enough (instead of just having found a more complex and therefore unspotted way of cheating...).
Just my 2c,
Ingo
However, I would just like to make sure I got it correctly.. I was wondering whether I am allowed now to use Marius' solution, or is it still cheating. I think it should be ok, as even though I use training data information during test data feature extraction, the model has been trained without knowing about the test data. Am I right?
yes, Marius solution is perfect. This is of course no problem - you always use information about the training data (in most cases: the generated model ) for model application. The other way round, putting testing information into training, is the problem.
Cheers,
Ingo