The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Mining Twitter - Data loops
timeitself
Member Posts: 1 Learner III
Hi all.
Working on my PhD dissertation, I downloaded ~5K tweets in a JSON format, placed them in a MongoDB database, extracted re-tweet graph data to be analyzed by Gephi/NodeXL, extracted text for a semantic analysis with RapidMiner.
Tweets texts are in a CSV (I could extract them in other formats as well), 1 tweet text per row, for a total of ~5K rows.
I need to analyze every tweet to get something close to a semantic value, that for a very first round could be a list of the words (per each of the tweets), after tokenization, n-gramming and filtering stopwords. I will extract a semantic value out of the words after that (by word-based semantic distance).
I'm far from being proficient in RapidMiner (my apologies!) and what I got reading the CSV file is a list of words for all the tweets, not the individual ones.
I would probably need a loop starting from the 1st row, processing it and iterate till the end of the rows.
I couldn't find a way to use the loops operators in the proper way ...
Your help would be highly appreciated!
Thanks
Carlo
Working on my PhD dissertation, I downloaded ~5K tweets in a JSON format, placed them in a MongoDB database, extracted re-tweet graph data to be analyzed by Gephi/NodeXL, extracted text for a semantic analysis with RapidMiner.
Tweets texts are in a CSV (I could extract them in other formats as well), 1 tweet text per row, for a total of ~5K rows.
I need to analyze every tweet to get something close to a semantic value, that for a very first round could be a list of the words (per each of the tweets), after tokenization, n-gramming and filtering stopwords. I will extract a semantic value out of the words after that (by word-based semantic distance).
I'm far from being proficient in RapidMiner (my apologies!) and what I got reading the CSV file is a list of words for all the tweets, not the individual ones.
I would probably need a loop starting from the 1st row, processing it and iterate till the end of the rows.
I couldn't find a way to use the loops operators in the proper way ...
Your help would be highly appreciated!
Thanks
Carlo
0
Answers
I suppose you are the Process Documents from Data operator. Like any other Process Documents operator, it provides two outputs: the word vector, which indeed delivers global statistics, but also an example set, which contains word counts for every single document. If you switch the vector_creation to Term Occurrences, you get absolute numbers. For classification/regression tasks etc. however, you usually will use the TF/IDF norm.
Best regards,
Marius