The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Apply IDF of training set in test
Hi,
I am trying to use RM to solve a Document Classification problem. I use two different Process Document from Files. One for the test documents and one for the train documents. The problem I have is that they apply TF-IDF for each document based on the specific set. In Text classification, the creation of TF-IDF for the testing documents is performed using the IDF from the train documents.
For instance, if we only want to classify one document (using the same structure), the TF-IDF for the document should be based on the occurrences of terms in the document and the IDF previously computed based on the training collection. In the same example, if IDF is based on the test document alone all the features will become 0, as all the terms appear in all documents (one) of the test collection.
The only option I can think of is to store the IDF for the train document terms and then multiply them by the TF of the test documents but it sounds a bit like a hack. Is there any operator or some parameter I am missing?
Regards,
I am trying to use RM to solve a Document Classification problem. I use two different Process Document from Files. One for the test documents and one for the train documents. The problem I have is that they apply TF-IDF for each document based on the specific set. In Text classification, the creation of TF-IDF for the testing documents is performed using the IDF from the train documents.
For instance, if we only want to classify one document (using the same structure), the TF-IDF for the document should be based on the occurrences of terms in the document and the IDF previously computed based on the training collection. In the same example, if IDF is based on the test document alone all the features will become 0, as all the terms appear in all documents (one) of the test collection.
The only option I can think of is to store the IDF for the train document terms and then multiply them by the TF of the test documents but it sounds a bit like a hack. Is there any operator or some parameter I am missing?
Regards,
0
Answers
you probably want to connect the wor output of the Process Documents used for training to the wor input of the Process Documents operator for testing.
Best,
Marius
For the experiments I am running at the moment, even when words are plugged-in, they only use the list as a filter. Therefore, IDF is still computed from the test set. Good point though
Thanks a lot for the rapid response,
why do you apply the feature selection on the test set and not on the training set?
The TF-IDF calculation on the test set considers the word vector of the training set, if you connect the wor outputs. Consider this process, especially the value of blu with wor connected or disconnected:
About the example, it shows clearly that IDF is considered if the words are connected. I tried to do the same experiment with my data a couple of days ago but all the features had a value of zero. It is clear that the mistake was somewhere else, I should have been more careful.
Thanks for the help