Text Classification
Hi there!
I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue.
Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).
What I want to do:
1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation).
2) To train SVM classifier on dataset2 and apply the model on dataset1.
3) Use dictionary of phrases as features to dataset1.
My questions:
1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.
But while running I get this error message:
In WikiTraining I have 10000 sentences, in debates 2000.
But I don't get the problem. Can someone please explain me and how can I avoid it?
2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?
Thank you in advance!
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again.
If not, and you want it to be part of a dictionary, you should use the approach that Martin took here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067
What you would have to do is put them into a CSV file and delimit using a comma or something.
2
Answers
Thank you Thomas_Ott!
I even didn't take into consideration that it can cause a problem but sure! Thank you!
And can anyone give any advice regarding the second question?
Couldn't find anything helpful. Only information about using existing dictionaries and most of the adviced are based on installing the extension for a specific dictionary.
Does anyone else have some advices or links? Not asking for solutions.
The Wordnet extension (free in the Marketplace) has an operator that allows you to use a custom sentiment dictionary in the SentiWordnet format. See that extension for more details.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for your reply.
One last question.
The WordNet dictionary is basically... a dictionary where 1 observation is 1 word.
What I need is a bit different — I want to see let's say "some experts", "in a new direction", "some challenges". So 2 or more words as one observation.
So as a result I want to see that each feature of my SVM classifier would be presented as these phrases in brackets above.
Do you have any hint/idea on it as well?
Haven't tried yet your proposal but it sound like what I have been looking for!
Thank you very much!