Prune a large set of features in case of text classification
I am dealing with the binary text classifcation task. I've done several preprocessing steps for my training data (stopwords, stem, morphology, low case, n-grams creation etc.) and created TFIDF-Vector. I deleted the rare n-grams (prune belowe 5%) and got 18000 n-grams. The choice of cuttoff is arbitrary and its borthers me. Then I've applied linear C-SVM (LibSVM). Unfortenately, the accuracy of my model for test set is very low. I think, I have to many feautes left and want to reduce their amount. So I decided to use information gain to reduce all features to most informative words. So I used operator "Weight by information gain" and then "Select by Weights" after "Process Documents"-Operator. At the and I used cross-validation with the linear SVM in it. But I got an error that the sample does not include the meta data. I am not sure what am I doing wrong and how to improve it.
Besides, what is the best way to prune a large set of features down to a manageable set of the most discriminative features and how to implement it in Rapdiminer? How else can I improve the performance of my model?
Answers
I'm assuming that this post is related to the imbalanced thread you started here. I would definately start with balancing your training data first and then feeding it into your text processing. From the sounds of it, you setup for text processing sounds pretty standard. What I would consider is putting both the Text Processing and Validation with your Linear SVM inside a Optimize Parameters and vary the C for the SVM and the Pruning parameters. This way you can see if adjusting those parameters, with your balanced data, can get you some better performance.
Thank you a lot for the suggestions. I will try to play with the parameters first. Which performance measure is appropriate in this case for comparison? I would work with AUC? If I still want to use feature reduction based on IG and Chi sq? How could it be implemented in case of text classification?