The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text analysis of single words
Hi everyone. I am struggling with a text analysis. I've done all the process in order to transform and tokenize all my document. But now I need to find what are the words "related" to other specific words. For example, I want to find, in all my document, all the words which come after the word "I", "we" and "you".
I tried many different operators but I can't come up with a solution.
Thank for your help
Tagged:
0
Best Answers
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi Buggia!
You could try creating "term n-grams" with n = 2. This would give you all combinations of "I word", "we word" etc. Then you would filter for the terms with the prefixes you're interested in (I, we, ...) and extract the word after the space.
Here's an example process:<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="-1"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="9.4.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34"> <parameter key="text" value="This is my silly text with some combinations of "I am", "I will", "I won't", "we had", "we have" and "we don't have". And again "I am". "/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="179" y="34"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.4.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="246" y="34"> <parameter key="max_length" value="2"/> </operator> <operator activated="true" class="text:filter_tokens_by_content" compatibility="9.4.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="380" y="34"> <parameter key="condition" value="matches"/> <parameter key="regular_expression" value="^(I_|we_).+"/> <parameter key="case_sensitive" value="false"/> <parameter key="invert condition" value="false"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/> <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/> <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <connect from_op="Process Documents" from_port="word list" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
Regards,
Balázs1 -
Buggia Member Posts: 4 Learner IHi BalazsBaranyThank you for you kind answer. Since I am not very familiar with coding language, could you please explain to me in terms of "operatos" involved in the process?
Thank you again for your help.0 -
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi Buggia!
The first operator just creates a document with an example text. Its output goes to "Process Documents". This is a container for additional operators to be executed inside.
Tokenize splits the words into single units on "word boundaries" like spaces.
Generate n-grams (Terms) creates every combination of word pairs. (There's Generate n-grams (Characters) that would do the same but for characters inside the words.)
Filter Tokens by Content keeps the generated "tokens" (the n-grams) that match a regular expression. Here I used ^(I_|we_).+ to refer to I or we as words in the beginning of the token. These are the words you are searching for. If you want to extend the regular expression, add your term inside the parentheses with the pipe | as the separator.
And that's it. The wordlist output contains the combinations found in the text and their frequency.
BTW, every operator has extensive documentation in the Help tab in Studio.
Regards,
Balázs2
Answers
The default way to create attributes in a text mining context is TF-IDF: Term Frequency, Inverse Document Frequency.
Term Frequency: How often is a word (token) in a document.
Inverse Document Frequency: In how many documents the word (token) is.
You can select another method in the "vector creation" parameter of "Process Documents". For example, Term Occurences just gives you the number.
The Word list output always contains the absolute numbers, that's why I recommended to use that. There's an operator "WordList to data" for converting the special table to a normal one, for example for further processing or putting the contents into a database.
Regards,
Balázs