The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Apply Model: Testing & Training Sets Differ
Hi
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets.
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets.
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks
0
Best Answers
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornThe word list elements will be constrained but the TF-IDF values will be recalculated on the new sample in Process Documents.5
-
jacobcybulski Member, University Professor Posts: 391 UnicornBe careful here, if your text processing in training uses pruning, make sure that in testing not only you use your saved word list to constrain the terms used in TF-IDF vector, as suggested by @Telcontar120, but you must switch off pruning, or else your word list may be shrunk in the pruning process thus rendering the two sets incompatible when applying the model to a test data.
5 -
jacobcybulski Member, University Professor Posts: 391 UnicornI have noticed now that you reduce dimensionality with weight-select method, in which case pass the list of weights to your testing branch, in which you do not need the weighing operator and you use the select using the weights from training.
5
Answers
This works, using the word output of the training leg but what if I am processing that information after the process docs operator and reducing features by using a select by weight operator?