"Sentiment Analysis - Numerical Labels, and the search for the right Process"

andk · March 2011

I have got a question again which might be easy to answer for those of you who already played around with the Sentiment Analysis qualities of Rapidminer. On the one hand I have a collection of thousands of documents where i extracted the information I need and compiled a matrix with the concerning T-IDF scores of expressions appearing in the documents. On the other hand I have a matrix with words which also contains a certain sentiment score between 0 and 1 attributed to each word. The question is now how to bring these two strings together to measure the sentiments reflected in the documents over time. The idea now is match the T-IDF matrix with the word/sentiment score matrix. Or more precisely, I want to look which expressions of the sentiment matrix also appear in the concerning documents and weight them with the respective IDF values. Is there a process which does this? I tried to go along the example described here http://rapid-i.com/rapidforum/index.php/topic,2993.0.html and the classification approach presented in the Vancouver Data Blog Video Tutorial 5 but it seems that the problem hinges on the fact that the Learning Processes don't accept numerical labels. Could somebody give me a hint? I would really appreciate that!

Best regards,

André

land · March 2011

Hi Andre,
this is a very unusual approach. Normally you want to avoid to put up this sentiment/word matrix yourself and let it do the program! You normally assign all your documents a certain sentiment and then apply a learning scheme to derive the effects.
If you have manually assigned these factors, you have done data mining manually and derived some sort of a linear model. What you have to do is to put them into a model so that you are able to apply them. There's no suggested way for this, because, well, as I said: Nobody normally wants to do this.
Only thing I can imagine is exporting a linear regression model in XML and then manually edit this file and reimporting it...

Greetings,
Sebastian

andk · March 2011

sebastian, thanks for your reply. i was deconnected the last two days. i think there is a misunderstanding or i explained my problem a little bit complicated. i have a wordvector with attributed sentiment values for each word. it is from sentiwordnet i just calculated a useful measure for my purpose out of the given values. additionally i have a wordlist and and idf matrix respectively gained by normal wordprocessing out of a pretty huge amount of documents. my idea now is to create a wordlist out of both dataprocessing processes and match them against each other. this means i want to look which of the expressions for which i have a sentiment value appears in the word list extracted of the documents. i tried to do this with the cross distance process. but the wordlist results from the document processing process, and in order to select the right attribute i have to transform the wordlist with the data to wordlist process. it turns out that the wordlist to data processor formats the expressions in my wordlist to a polynominal for and it seems that the crossdistance processor can't handle this. which parameter of the crossdistance process would be the right one to match nominal expressions?
guys i hope i don't nerve you too much. as far as this is possible i will also contribute on the helping side in this forum!

best regards, andre

land · April 2011

Hi,
I think you explained what you are doing in an understandable way, but I don't WHY you should do this? What would be the meaning of the result?

Greetings,
Sebastian

andk · April 2011

it should simply give me the possibility to estimate the sentiment of an article as i already have sentiments for several tokens of it. so actually i have two word lists one from my articles and one with attributed sentiments and i have to link these two parts or in other words i have to look which of my sentiment tokens appears in which article. i am unfortunately lacking the technical skills to use this cross distance operator right because i think this should be actually the right operator for me. anyway thank you sebastian for your effort! but if you should come across this topic again and you would have an idea it would be very helpful to share it with me.

best regards, andre

andk · April 2011

is there really nobody who could help me? Just to clearify a bit more what I want to do and make it more attractive ^^ to help me i have created tables to show what i would like to do.
Sentiment Wordlist (created from a CSV file)(Tab1)

ID	Word	Sentimentscore
1	able	0.7
2	cat	0
3	competent	0.6
4	corrupt	-0.6
5	house	0.1
6	...	...

The Wordlist gained through processing documents (Tab2)

ID	Word
1	able
2	cow
3	house
4	competent
5	computer
6	...

Now I want to look if and where there are matches between word columns of Tab1 and Tab2. The best thing would be to have a vector with distance or similarity measures for all combinations of words of Tab1 and Tab2. Also the metainfo, Sentimentscore, should not be lost in this process. Is there something which could help me in this taks. This oculd maybe look like this:

Tab1	Tab2	Distance	Sentimentscore
able	able	0	0.7
able	cow	1	0.7
able	house	1	0.7
...	...	...	...
competent	competent	0	0.6
...	...	...	...

I want to underline that this is just for academic, self interest purposes. I am orientating myself what I could do in my thesis and play around a little bit with RM. I am looking forward to your comments!

Best regards,

André

IngoRM · April 2011

Hi,

are the words in Tab2 unique (I guess they are at least in Tab1)? If yes, a simple "Join" would be sufficient with the word columns as IDs if you are interested in "full match" (distance 0) vs. "no match" (distance 1) only.

Otherwise a more complex process has to be created which would definitely also be possible.

Cheers,
Ingo

andk · April 2011

ingo thanks alot! ahhhh

ok this is an approach which i will test as soon as i am on my windows RM machine again. how would such a more complex process which distances look like? i don't need details just a hint or a sketch which operators might work and how the roles of the word attribute would have to be set! thanks for your help!
merci!

andré

IngoRM · April 2011

Hi again,

actually, even if the words in Tab2 are not unique, the join approach should work pretty well. You will end up (depending on using a left or a right join) with a data set Tab2 with an additional column containing the corresponding sentiment scores from Tab1. A simple aggregation (average or sum) will then deliver the final, aggregated score for the document encoded in Tab2.

Well, if you want to calculate text based similarities, I would have a look into the Text Extension of RapidMiner and use the preprocessing operators delivered. You could, for example, transform the words into their stems, use character n-grams and other approaches for calculating the distances between the terms in both tables. Of course it would also be possible to loop through both tables and perform any type of distance measure you can build with operators inside. Finally, you could of course write your own distance measure and use it within RapidMiner. There are probably hundreds of options. Have fun trying them!

Cheers,
Ingo

andk · April 2011

Ingo you are a hero! Thank you very much! I will try your advices and will report!

Have a nice weekend!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Sentiment Analysis - Numerical Labels, and the search for the right Process"

Answers