Apply Model: Testing & Training Sets Differ

Hyram · July 2020

Hi
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets.
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks

Telcontar120 · July 2020

The word list elements will be constrained but the TF-IDF values will be recalculated on the new sample in Process Documents.

jacobcybulski · July 2020

Be careful here, if your text processing in training uses pruning, make sure that in testing not only you use your saved word list to constrain the terms used in TF-IDF vector, as suggested by @Telcontar120, but you must switch off pruning, or else your word list may be shrunk in the pruning process thus rendering the two sets incompatible when applying the model to a test data.

jacobcybulski · July 2020

I have noticed now that you reduce dimensionality with weight-select method, in which case pass the list of weights to your testing branch, in which you do not need the weighing operator and you use the select using the weights from training.

Hyram · July 2020

Apologies - I see this was solved by Marius and Ingo in 2012. Was wondering - if you join word list output of process documents from train leg to word list input of process docs on test leg, if it uses same TF values or zeros for out put of process docs on test leg. The values carried through are indeed zero.
This works, using the word output of the training leg but what if I am processing that information after the process docs operator and reducing features by using a select by weight operator?

Hyram · July 2020

Thanks very much @Telcontar120 and @jacobcybulski!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Apply Model: Testing & Training Sets Differ

Best Answers

Answers