"Apply model to test set feature selection"

frozenarc · April 2011

Good evening, I have questions and I hope I can have a great answer here

because I'm a newbie and I need to do this as fast as I could..

I need to do a sentiment analysis to predict "positive" and "negative"

Sorry before if it's a double post.. but I still don't understand about this thing, I already watch the video tutorial in rapid-i.com and http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-5.html but I still don't get what I supposed to do

Here's the thing:
I have 2000 movie review, 1000 of them put in a folder named "Positive" (including positive review) and another 1000 in a folder named "Negative" (including negative review)

I have to extract the feature in that 2000 review, so I used "process document from files" as in the vancouverdata.blogspot.com tutorial, and I create a word vector (TF-IDF). after that the process create a result called "ExampleSet" and it has 12305 attributes. It means, I have 12305 features extracted from the 2000 reviews, right?

From this point, I need to do a feature selection.. how can I do that? I see there are operators such as backward elimination, Forward selection and so on but I confuse how to use that.. I download the "feature selection extension" and I use "Recursive Feature Elimination (RFE, SVM-RFE))" (this operator use a top k method) but I can't find the documentation about what this method do exactly to eliminate the features. can you help me?

After using the feature selection, I have to train the data.. thus, I use a classifier (let's say for this example, I use Naive Bayes). When I use a classifier, it means I train the exampleSet, right? Now where could I find the complete documentation about what exactly Naive Bayes operator do in RapidMiner to train the data?

After the data already trained, it means the model is already created to right? I want to apply this model to another movie review (Test Set). I have another 100 movie review and I put 50 of them in folder called "pos" (including positive review) and another 50 in folder called "neg" (including negative review). I want to apply the model so it can predict whether it's positive or negative, how to do that?

after that, I need to create a report in excel format.. How can I export the exampleSet and performanceVector to xls automatically? Is it possible?

to summarize what I need:
1) is 12305 attributes in "ExampleSet" is 12305 feature?
2) How can I do feature selection to that 12305 feature using forward selection or other optimization method?
3) THE MOST IMPORTANT: how can I apply model generated from trainSet into my own testSet?
4) Where could I find the complete documentation about what exactly an operator do in rapidMiner? (as Rapid-i wiki is not what I expected though)
5) How could I export all the result into report in excel so it's easy to see and can be opened without using rapidMiner?

That's all for now.. To be honest I'm an IT student but I really don't have a background in machine learning, natural language processing, Information Retrieval, or data mining.. so I really need a help cause I'm newbie, but I seriously want to learn.. thx a bunch

I hope I could get the answer as soon as possible..

IngoRM · May 2011

Hi,

whoa, a lot of questions. If you need this "as fast as you could" you should consider professional consulting

1) is 12305 attributes in "ExampleSet" is 12305 feature?

Yes, the regular ones at least. There might be a set of special attributes like the label or an id. By the way: reading the manual would help here and even if you don't have much time - it will save you time at the end

2) How can I do feature selection to that 12305 feature using forward selection or other optimization method?

I would never suggest forward selection on a feature set such big. At least not by evaluating the performance with an inner learning scheme, using a cross validation etc. This would take way too long. Actually, there have been papers around showing that feature selection for text classification is not the best idea anyway (depending on the learning scheme). For exactly that reason, Support Vector Machines work quite well on texts. They can work on high dimensions and the take the redundancy of features into account by weight sharing. I would recommend to follow that much more simple way, especially for beginners.

3) THE MOST IMPORTANT: how can I apply model generated from trainSet into my own testSet?

Come on, there are lot of samples for that around...

Generate a model (and / or load it from your repository if you have generated it in another process before), load in the test set, preprocess the test data in exactly the same way (use the wordlist which was created during the training and / or apply the same feature selections), and finally use the operator "Apply Model" for creating the predictions.

4) Where could I find the complete documentation about what exactly an operator do in rapidMiner? (as Rapid-i wiki is not what I expected though)

What's wrong with the Wiki which is also used for the context-sensitive documentation inside of the program? If you have the feeling that you need more information, you could of course always have a look int In the program source code itself. This shouldn't be too hard since you are an IT student

5) How could I export all the result into report in excel so it's easy to see and can be opened without using rapidMiner?

Depends a bit on what you want to output. If it should be "Report"-like, I would recommend the Reporting Extension or even better, directly go for RapidAnalytics instead. If you are only interested in the predictions or other types of data, you could just use the operator "Write Excel".

Hope that helps,
Ingo

frozenarc · May 2011

Whoaaa.. thx for the answer, I really appreciate it..

I tried what u suggest about applying the model to test set.. but I've got another problem... As I said before, I have 12305 features.. and regarding to ur answer, I put that 12305 word list to the preprocessed test set, right? and I tried to put breakpoint there just to make sure that the test set's example set had the same features as the training set..

And I already optimize that 12305 features using SVM so I get 500 features.. the question is, the output of this optimization is example set.. how can I convert this example set into word list so I can put that 500 features as word list to the preprocessed test set? thx a bunch

IngoRM · May 2011

Hi,

I put that 12305 word list to the preprocessed test set, right?

Yip, that's right.

how can I convert this example set into word list so I can put that 500 features as word list to the preprocessed test set?

I assume you used something like "Weight by SVM" or something similar to reduce the feature set to 500, right? If yes, you would not need to transform the selected attributes to a word vector but you could simply use the very same weights for selecting the features of the test set as well. You could use the operator "Select by Weights" for this. If you don't have weights, you could create them with the operator "Data to Weights" from your training set first.

If you search for those operators here in the forum, you will probably find some processes using them.

Cheers,
Ingo

frozenarc · May 2011

waw thx for your reply.. I try to use ur method, but It doesn't work as what I expected too though.. so I search this forum and I found the answer (somehow), thx anyway for ur response though, it really helps me and made me pop an idea how to get rid of my problem

If I have any problem again, I'll ask for ur help again.. thx a bunch

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Apply model to test set feature selection"

Answers