Text Analysis on documents collection coming from a CSV

gustavo_velho · October 2016

Hello!

I'm new to Rapidminer, and my main focus is to use it for text analysis for social media posts. I have a CSV file with several columns, and each row is a post/document. One of the columns is the text/body of the document. How can I select only that specific column for text analysis, but, at the same time, keep all other columns for further analysis, since they are still relevant?

Right now I have a process like:

Read CSV -> Select Atributes (to select only body column) -> Data to Documents -> Process Documents (Tokenize, Transform cases, N-Grams etc) -> WordList to Data

This works to see the list of most common words/n-grams, but now I lost all the related data for each document. I would like to, for example, filter the documents containing a specific n-gram or word. Any tip would be helpful.

Thanks!

Gustavo Velho

MartinLiebig · October 2016

Gustavo,

simply use "Keep text" in the Process Documents operator. That way you should have an additional attribute with the text together with your bag of words in the upper port of the operator.

~Martin

gustavo_velho · October 2016

Thanks Martin! That seems to make sense, I'll test it. But let me add this: what about other data from a document? I have a file like:

AUTHOR | DATE | CONTENT | SOURCE

A 10/26 Lorem Ipsum... http://source.com

B 10/27 Lorem Ipsum... http://source.com

I see that Rapidminer has several other statistics, so I would like to benefit from that also after text analysi.

Thanks again!

Gustavo Velho

MartinLiebig · October 2016

Hi,

Process Document should preserve the ID attribute as well. That way you can simply join the resulting bag of word example set with the former. Maybe Process Documents is also preserving all special roles. Would need to check this.

~Martin

gustavo_velho · October 2016

Thanks Martin! That makes sense. I was figuring that out, that I would need to join documents table with the words list or something.

I've been using other tools for text analysis, and now I'm starting to test Rapidminer. Rapidminer seems to have a better tokenization process so far, so let's see how the rest goes.

Appreciate your help!

Gustavo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Analysis on documents collection coming from a CSV

Answers