Text Analysis on documents collection coming from a CSV
Hello!
I'm new to Rapidminer, and my main focus is to use it for text analysis for social media posts. I have a CSV file with several columns, and each row is a post/document. One of the columns is the text/body of the document. How can I select only that specific column for text analysis, but, at the same time, keep all other columns for further analysis, since they are still relevant?
Right now I have a process like:
Read CSV -> Select Atributes (to select only body column) -> Data to Documents -> Process Documents (Tokenize, Transform cases, N-Grams etc) -> WordList to Data
This works to see the list of most common words/n-grams, but now I lost all the related data for each document. I would like to, for example, filter the documents containing a specific n-gram or word. Any tip would be helpful.
Thanks!
Gustavo Velho
Answers
Gustavo,
simply use "Keep text" in the Process Documents operator. That way you should have an additional attribute with the text together with your bag of words in the upper port of the operator.
~Martin
Dortmund, Germany
Thanks Martin! That seems to make sense, I'll test it. But let me add this: what about other data from a document? I have a file like:
AUTHOR | DATE | CONTENT | SOURCE
A 10/26 Lorem Ipsum... http://source.com
B 10/27 Lorem Ipsum... http://source.com
I see that Rapidminer has several other statistics, so I would like to benefit from that also after text analysi.
Thanks again!
Gustavo Velho
Hi,
Process Document should preserve the ID attribute as well. That way you can simply join the resulting bag of word example set with the former. Maybe Process Documents is also preserving all special roles. Would need to check this.
~Martin
Dortmund, Germany
Thanks Martin! That makes sense. I was figuring that out, that I would need to join documents table with the words list or something.
I've been using other tools for text analysis, and now I'm starting to test Rapidminer. Rapidminer seems to have a better tokenization process so far, so let's see how the rest goes.
Appreciate your help!
Gustavo