Basic Text Mining From an Excel File

monamahfouz · July 2014

Hi everyone,

I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.

A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.

If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.

I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.

Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!
Mona

MariusHelf · July 2014

Hi Mona,

you don't need any Text Processing operators (in the RapidMiner sense) at all. First let's ignore the multi-tag rows:
Load your data, and add a Filter Examples operator with the attribute_value_filter "Hashtag != .* .*" (without the quotes).
Then add an Aggregate operator. Group by Hashtag and add the aggregation function count for Hashtag. That's it

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Basic Text Mining From an Excel File

Answers