Select a range of a Wordlist and sort it
Hello everybody,
I'm completely new to rapidminer, so please be patient.
I have the following problem:
I'm analyzing tweets, in this case the total occurence of the used words. As a final step I want to slect the top 10 words and sort them in a descending order.
I already tried this
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/RapidMiner-range-by-occurence/m-p/198824
But it doesn't work and I get the following error:
Potential problem detected
The parameter first_example indexes an example, but the value 1 exceeds the example set size
I hope I'm not asking a stupid question and thanks for your help!
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:get_twitter_user_statuses" compatibility="7.3.000" expanded="true" height="68" name="Get Twitter User Statuses" width="90" x="45" y="34">
<parameter key="connection" value="TwitterConnection"/>
<parameter key="user" value="dieLinke"/>
<parameter key="limit" value="1000"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Created-At|Text"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="Created-At.le.09/24/2017 11:59:59 PM"/>
<parameter key="filters_entry_key" value="Created-At.ge.03/24/2017 00:00:01 AM"/>
<parameter key="filters_entry_key" value="Text.does_not_contain.RT"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Text|Id"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="136"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
<parameter key="prune_below_absolute" value="20"/>
<parameter key="prune_above_absolute" value="1"/>
<parameter key="prune_below_rank" value="1.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="45" y="136">
<parameter key="string" value="@/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="45" y="238">
<parameter key="string" value="https"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="45" y="340">
<parameter key="string" value="//t.co/"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="45" y="442">
<parameter key="string" value="&amp"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="45" y="544"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="289"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="136">
<parameter key="min_chars" value="3"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_pos" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="715" y="34">
<parameter key="language" value="German"/>
<parameter key="expression" value="A.*|N.*|V.*"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
<connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="45" y="391"/>
<operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="391">
<parameter key="create_nominal_ids" value="true"/>
<parameter key="offset" value="1"/>
</operator>
<operator activated="true" class="sort" compatibility="7.6.001" expanded="true" height="82" name="Sort" width="90" x="313" y="391">
<parameter key="attribute_name" value="id"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="447" y="391">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="10"/>
</operator>
<connect from_op="Get Twitter User Statuses" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Strange, I am not sure what happened the first time.
Here's a version of the process that I think does exactly what you want. Try it and let me know. You just have to type the attribute name "total" into the Sort parameter. I also checked "keep text" out of the Process Documents operator so you can view the word vector if you want.
Note that you could also make this a bit more efficient by moving part or all of the date filtering directly into the Twitter operator to avoid retrieving tweets that you don't even want in the first place, by using the "since id" and "max id" parameters (I didn't do that).
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:get_twitter_user_statuses" compatibility="7.3.000" expanded="true" height="68" name="Get Twitter User Statuses" width="90" x="45" y="34">
<parameter key="connection" value="Twitter"/>
<parameter key="user" value="dieLinke"/>
<parameter key="limit" value="1000"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Created-At|Text"/>
</operator>
<operator activated="true" breakpoints="after" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="Created-At.le.09/24/2017 11:59:59 PM"/>
<parameter key="filters_entry_key" value="Created-At.ge.03/24/2017 00:00:01 AM"/>
<parameter key="filters_entry_key" value="Text.does_not_contain.RT"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Text|Id"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="136"/>
<operator activated="true" breakpoints="after" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="648" y="136">
<parameter key="keep_text" value="true"/>
<parameter key="prune_below_absolute" value="20"/>
<parameter key="prune_above_absolute" value="1"/>
<parameter key="prune_below_rank" value="1.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="45" y="136">
<parameter key="string" value="@/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="45" y="238">
<parameter key="string" value="https"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="45" y="340">
<parameter key="string" value="//t.co/"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="45" y="442">
<parameter key="string" value="&amp"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="45" y="544"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="289"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="136">
<parameter key="min_chars" value="3"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_pos" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="715" y="34">
<parameter key="language" value="German"/>
<parameter key="expression" value="A.*|N.*|V.*"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
<connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" breakpoints="after" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="581" y="289"/>
<operator activated="true" breakpoints="after" class="sort" compatibility="7.6.001" expanded="true" height="82" name="Sort" width="90" x="715" y="289">
<parameter key="attribute_name" value="total"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="849" y="187">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="10"/>
</operator>
<connect from_op="Get Twitter User Statuses" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 2"/>
<connect from_op="Filter Example Range" from_port="original" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>1
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks for your quick reply!
I already tried going this route, but I get an error for my "Filter Example Range" Operator.
Potential problem detected
The parameter first_example indexes an example, but the value 1 exceeds the example set size.
But I don't know what I'm doing wrong, since the word list contains more than 100 different words...
Do you have any idea?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Your process is getting that error because it does not have any examples in it by the time it gets to the Wordlist!
I have found 2 problems you need to fix.
By placing breakpoints in the process, I was able to determine that the first fault is in your first "Filter Examples" operator. You are trying to filter tweets by date ranges, but the range of dates selected doesn't actually return any records, which leads to the empty data problem later. So you need to expand your filter date range window.
The other problem is that after turning the Wordlist into Data, you generate an id and then sort by that. But I think you want to sort by "total" (which represents the word frequency), since sorting by id does nothing to reorder the dataset.
After you make those two changes the process works fine as constructed and should give you the output you want.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks again for your reply! While that makes perfectly sense, there are still a few things which are confusing me. You said, that my time frame is basically too tight, but if I'm looking at the Wordlist outpot of the "Process Documents from Data" operator I get a bunch of words.
As seen here:
Also I'm not sure how to sort by total in this case.
Thank you so much!
You really helped me out a lot