loop a script over a large list of examples
Hi Experts,
I’ve a example set with 1 attribute and 1975 examples, each is the content of a web page).
The input looks like:
Over each example I’d like to execute an R Script to split the words, create a bi-gram graph list and store this in a list for later to analysis them.
I thought, I could use the Loop Value Operator to run the scripts over each example, but the Operator would loop over all 1975 examples for 1975 times.
If I use the Loop Example Operator it also runs over all examples but in this case the process terminates at the begin of the second loop with the error message: PM INFO: [1] "Failed to execute the script."; PM INFO: [1] "Evaluation error: argument `...` should be a character vector (or an object coercible to)."
This is my process:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve 18-01-04-list of 4650 crawled pages" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/data/18-01-04-list of 4650 crawled pages"/>
</operator>
<operator activated="true" class="generate_id" compatibility="8.0.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34">
<parameter key="create_nominal_ids" value="true"/>
</operator>
<operator activated="true" class="concurrency:loop_values" compatibility="8.0.001" expanded="true" height="124" name="Loop Values" width="90" x="313" y="34">
<parameter key="attribute" value="text"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words" width="90" x="45" y="34">
<parameter key="script" value="library(dplyr) library(tidytext) rm_main = function(data) { if(is.data.frame(data)){ 	spon_words <- data %>% 	 unnest_tokens(bigram, text, token = "ngrams", n = 2) 	 } 	print(spon_words) return(list(spon_words)) } "/>
</operator>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory" width="90" x="45" y="136"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat" width="90" x="45" y="238">
<parameter key="script" value="library(dplyr) library(tidytext) library(tidyr) library(tokenizers) rm_main = function(data) { devided_bigrams <-data %>% 	separate(bigram, c("word1", "word2"), sep = " ") 	print(devided_bigrams) return(list(devided_bigrams)) } "/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams" width="90" x="179" y="187">
<parameter key="script" value="rm_main = function(data) { 	library(dplyr) 	library(tidytext) 	library(tidyr) 	count_bigrams <- data %>% 	 count(word1, word2, sort = TRUE) 	print(count_bigrams) 	counted_bigrams <- data.frame(count_bigrams) return(counted_bigrams) } "/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (2)" width="90" x="313" y="187"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph" width="90" x="447" y="187">
<parameter key="script" value="rm_main = function(data) { 	library(dplyr) 	library(tidytext) 	library(tidyr) library(igraph) bigram_graph <- data %>% filter(n >= 10) %>% graph_from_data_frame print(bigram_graph) 	# bigram_graph <- data.frame(bigram_graph) 	library(ggraph) 	set.seed(2017) 	graph <- ggraph(bigram_graph, layout = "fr") + 	 geom_edge_link() + 	 geom_node_point() + 	 geom_node_text(aes(label = name), vjust = 1, hjust =1) 	setwd("/home/knecht") 	#graph.write(graph, "/home/knecht/graph01.txt",, "edgelist") 	#ggsave(filename = "foo300.png", width = 5, height = 4, dpi = 300, units = "in", device='png') 	 	 return(list(graph)) } "/>
</operator>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (3)" width="90" x="581" y="187"/>
<connect from_port="input 1" to_op="Split Text in Words" to_port="input 1"/>
<connect from_op="Split Text in Words" from_port="output 1" to_op="Free Memory" to_port="through 1"/>
<connect from_op="Free Memory" from_port="through 1" to_op="Seperat" to_port="input 1"/>
<connect from_op="Seperat" from_port="output 1" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_port="output 1"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="Count all Bigrams" to_port="input 1"/>
<connect from_op="Count all Bigrams" from_port="output 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="output 2"/>
<connect from_op="Multiply" from_port="output 2" to_op="Free Memory (2)" to_port="through 1"/>
<connect from_op="Free Memory (2)" from_port="through 1" to_op="draw graph" to_port="input 1"/>
<connect from_op="draw graph" from_port="output 1" to_op="Free Memory (3)" to_port="through 1"/>
<connect from_op="Free Memory (3)" from_port="through 1" to_port="output 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
<portSpacing port="sink_output 4" spacing="0"/>
</process>
</operator>
<operator activated="false" class="loop_examples" compatibility="8.0.001" expanded="true" height="124" name="Loop Examples" width="90" x="313" y="187">
<process expanded="true">
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words (2)" width="90" x="45" y="34">
<parameter key="script" value="library(dplyr) library(tidytext) rm_main = function(data) { if(is.data.frame(data)){ 	spon_words <- data %>% 	 unnest_tokens(bigram, text, token = "ngrams", n = 2) 	 } 	print(spon_words) return(list(spon_words)) } "/>
</operator>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (4)" width="90" x="45" y="136"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat (2)" width="90" x="45" y="238">
<parameter key="script" value="library(dplyr) library(tidytext) library(tidyr) library(tokenizers) rm_main = function(data) { devided_bigrams <-data %>% 	separate(bigram, c("word1", "word2"), sep = " ") 	print(devided_bigrams) return(list(devided_bigrams)) } "/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="179" y="34"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams (2)" width="90" x="179" y="187">
<parameter key="script" value="rm_main = function(data) { 	library(dplyr) 	library(tidytext) 	library(tidyr) 	count_bigrams <- data %>% 	 count(word1, word2, sort = TRUE) 	print(count_bigrams) 	counted_bigrams <- data.frame(count_bigrams) return(counted_bigrams) } "/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (4)" width="90" x="313" y="85"/>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (5)" width="90" x="313" y="187"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph (2)" width="90" x="447" y="187">
<parameter key="script" value="rm_main = function(data) { 	library(dplyr) 	library(tidytext) 	library(tidyr) library(igraph) bigram_graph <- data %>% filter(n >= 10) %>% graph_from_data_frame print(bigram_graph) 	# bigram_graph <- data.frame(bigram_graph) 	library(ggraph) 	set.seed(2017) 	graph <- ggraph(bigram_graph, layout = "fr") + 	 geom_edge_link() + 	 geom_node_point() + 	 geom_node_text(aes(label = name), vjust = 1, hjust =1) 	setwd("/home/knecht") 	#graph.write(graph, "/home/knecht/graph01.txt",, "edgelist") 	#ggsave(filename = "foo300.png", width = 5, height = 4, dpi = 300, units = "in", device='png') 	 	 return(list(graph)) } "/>
</operator>
<operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (6)" width="90" x="581" y="187"/>
<connect from_port="example set" to_op="Split Text in Words (2)" to_port="input 1"/>
<connect from_op="Split Text in Words (2)" from_port="output 1" to_op="Free Memory (4)" to_port="through 1"/>
<connect from_op="Free Memory (4)" from_port="through 1" to_op="Seperat (2)" to_port="input 1"/>
<connect from_op="Seperat (2)" from_port="output 1" to_op="Multiply (3)" to_port="input"/>
<connect from_op="Multiply (3)" from_port="output 1" to_port="example set"/>
<connect from_op="Multiply (3)" from_port="output 2" to_op="Count all Bigrams (2)" to_port="input 1"/>
<connect from_op="Count all Bigrams (2)" from_port="output 1" to_op="Multiply (4)" to_port="input"/>
<connect from_op="Multiply (4)" from_port="output 1" to_port="output 1"/>
<connect from_op="Multiply (4)" from_port="output 2" to_op="Free Memory (5)" to_port="through 1"/>
<connect from_op="Free Memory (5)" from_port="through 1" to_op="draw graph (2)" to_port="input 1"/>
<connect from_op="draw graph (2)" from_port="output 1" to_op="Free Memory (6)" to_port="through 1"/>
<connect from_op="Free Memory (6)" from_port="through 1" to_port="output 2"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve 18-01-04-list of 4650 crawled pages" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
<connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
<connect from_op="Loop Values" from_port="output 2" to_port="result 2"/>
<connect from_op="Loop Values" from_port="output 3" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Has maybe someone an idea how I can solve it?
regards
Tobias
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi,
You can use:
Normal Loop + Filter Examples Range on the iteration macro
or
Group into Collection + Loop Collection
Both should work fine for this.
Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany1
Answers
hello @TobiasNehrig - hmm I would much prefer to use the text processing extension and the operator "Process Documents from Data". Something like this with tokenize, generate n-gram, and so forth:
Does that help?
Scott
Hi @sgenzer
thanks for your answer. The input data was previosly crawled and pre-processed with Extract Content, Tokenize Sentences, Filer Stopwords and Filter by Length. But with the N-Gram Operator I didn’t get any further to create the graphs. So I try to implement the tidy text approach (www.tidytextmining.com) via the execute R Operator. With this approach I try also to find co-occurrences, if I get ahead to run the script over each of the 1975 examples.
Regards
Tobias
oh I see. OK. So you want a bigram graph for each example or for the full data set?
Can you attach the data set as shown in that screenshot?
Scott
hi, yes i'd like to create a bigram graph for each example.
I don't know how to attach the data set, but this is the process to generate the data set:
Hi @mschmitz,
thank you very much for your advice. For your hint with the normal loop, it doesn't work because of the lack of RAM of my computer. But the collection loop work perfectly in a very short time.
regards
Tobias
Hi @TobiasNehrig,
this most likely happens because the usual Loop runs in parallel which takes more memory. If you deactivate this it should also work with the usual loop.
Best,
Martin
Dortmund, Germany
so nice little trick in RapidMiner - right-click on the results tab you want to save and then choose "Store ExampleSet in Repository":
Then if you want to send it to someone, the easiest is just to locate it on your drive and use the "Choose Files" button here in the community post section to attach:
OK now to your bigram graphs...stay tuned.
Scott
ok so here's the story - you can do this the "hacker" way, or the "right" way. The "hacker" way is to use the very old, almost-deprecated Reporting Extension that will create a PDF with your graphs. I'm attaching a process to this post so you can see how I did this.
The reason that this is the "hacker" way is that RM Studio is not really designed to do BI stuff. It's a data science platform - we leave BI to others like Qlik, Tableau, and so forth OR we push results to production in RM Server. So the "right" way to do this is to use one of those techniques.
Process and sample result PDF attached. Note you will need to add the Reporting Extension to RM Studio.
Scott
Hi @sgenzer, thank you very much for your help. Well I choose the old 'fashion' way
Tobias