Implement pairwise_count with execute R
Hi Experts,
I‘d like to implement in an execute R Operator the widyr function pairwise_count() like in https://www.tidytextmining.com/nasa.html#word-co-ocurrences-and-correlations. For this I crawl some pages and process them. But somehow it won’t function. I’ve this error message:
Dec 22, 2017 3:45:59 PM INFO: [1] "Failed to execute the script."
Dec 22, 2017 3:45:59 PM INFO: [1] "replacement has 0 rows, data has 2"
This is how my process looks like:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="82" name="Crawler Spon" width="90" x="45" y="34">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
<parameter key="url" value="http://www.spiegel.de"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+www.spiegel.+"/>
<parameter key="follow_link_with_matching_url" value=".+spiegel.+|.+de.+"/>
</list>
<parameter key="max_crawl_depth" value="10"/>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="max_pages" value="5"/>
<parameter key="delay" value="100"/>
<parameter key="max_concurrent_connections" value="200"/>
<parameter key="max_connections_per_host" value="100"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="34">
<parameter key="link_attribute" value="Link"/>
<parameter key="page_attribute" value="link"/>
<parameter key="random_user_agent" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data Spon" width="90" x="179" y="34">
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="99999"/>
<parameter key="data_management" value="memory-optimized"/>
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="link" value="1.0"/>
</list>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="minimum_text_block_length" value="2"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize Token" width="90" x="179" y="34">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens a-zA-Z" width="90" x="313" y="34">
<parameter key="condition" value="matches"/>
<parameter key="regular_expression" value="[a-zA-Z]+"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize Token" to_port="document"/>
<connect from_op="Tokenize Token" from_port="document" to_op="Filter Tokens a-zA-Z" to_port="document"/>
<connect from_op="Filter Tokens a-zA-Z" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="text"/>
</operator>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Execute R" width="90" x="447" y="34">
<parameter key="script" value="# rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) rm_main = function(data) { library(dplyr) library(tidytext) library(widyr) set.seed(2017) test <- data %>% pairwise_count(word, text, sort= TRUE) print(test) return(list(test)) } "/>
</operator>
<connect from_op="Crawler Spon" from_port="out 1" to_op="Process Documents from Data Spon" to_port="example set"/>
<connect from_op="Process Documents from Data Spon" from_port="example set" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Execute R" to_port="input 1"/>
<connect from_op="Select Attributes (2)" from_port="original" to_port="result 2"/>
<connect from_op="Execute R" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Maybe there is someone who can help me to tackle that problem.
Regards
Tobias
Best Answer
-
SGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
Hi Tobias,
If I understood correctly, you want to pass the result of pairwise_count() to RapidMiner. That is easy:
dt <- as.data.table(pairwise_count(. . .))
return(list(dt))I hope that's what you are looking for and sorry for the delayed response
1
Answers
Hi,
I expand my script with the unnest_tokens function, I thought it might help with pairwise_count function:
test %>%
unnest_tokens(word, text, token="words") %>%
print(test)
test <- data.frame(test)
On the console I can see each word in a row but in the result tab all word for a Document are in a row again.
Now the script runs with out an error but the pairwise_count function deliver no results.
This is how my process now looks like:
Regards
Tobias
The problem lies entirely in your R code. You are passing a table with a single column to the script, where you actually need to have the data in a tidy form. You have to work in either in RM or in R to have the data like this:
Document Word
1 house
1 dog
2 house
2 cat
3 house
3 dog
Then the script will determine that the combination (house, dog) appears 2 times and (house, cat) once. In your script there are also variables that are undefined (word, text). If you choose to work it out in R, I recommend to save the intermediate results in CSV and then try to solve it interactively. You can also do everything in RapidMiner using n-grams.
Best,
Sebastian
Hi@SGolbert
Many thanks for your hint. I solved the problem with pairwise_count, over all documents. So next I‘ll have to find a solution that pairwise_count runs over each single document.
But I have also the problem that I can only see the result at the console and I would like to have the results at the RapidMiner Results Table. May I ask you, if you have any advise for this problem?
This is my code:
Kind regands
Tobias
Hi Sebatian,
@SGolbert,
thank you for your response and yes, your understanding is right. I think I found the problem. It seems that there is no problem with my script. Because the same process but only different web page with significant less sub pages, sentences and word works. So I checked again and it seems that this message sould be the problem:
Mar 23, 2018 10:17:13 AM INFO: Written 48.6% of 73326128 rows in 2 secs using 8 threads. anyBufferGrown=yes; maxBuffUsed=30%. Finished in 2 secs.
Mar 23, 2018 10:17:13 AM INFO: Written 79.4% of 73326128 rows in 3 secs using 8 threads. anyBufferGrown=yes; maxBuffUsed=30%. Finished in 0 secs.
Mar 23, 2018 10:17:13 AM INFO:
Mar 23, 2018 10:18:43 AM INFO: Saving results.
Mar 23, 2018 10:18:43 AM INFO: Process //Local Repository/processes/18-03-23-test-pairwise_count finished successfully after 2:08
Process that won't work:
Proccess that works:
Hi,
I solved the output problem with filtering my counted words with n>=10 and all valid results are shown. But for me it is not an option to filter results.
To get my counting without filtering I'm trying to cluster by the id of my pages and use the operator loop cluster. Now my problem is, that I‘d like to see counting results for each ID. I tried the collection operation inside and out side of the loop cluster but I always get only the result of the last loop.
Is there a way to see all results and compare them in the following?