loop a script over a large list of examples

TobiasNehrig · January 2018

Hi Experts,

I’ve a example set with 1 attribute and 1975 examples, each is the content of a web page).

The input looks like:

18-01-04-liste mit 1975 Spon Texten.png

Over each example I’d like to execute an R Script to split the words, create a bi-gram graph list and store this in a list for later to analysis them.

I thought, I could use the Loop Value Operator to run the scripts over each example, but the Operator would loop over all 1975 examples for 1975 times.

If I use the Loop Example Operator it also runs over all examples but in this case the process terminates at the begin of the second loop with the error message: PM INFO: [1] "Failed to execute the script."; PM INFO: [1] "Evaluation error: argument `...` should be a character vector (or an object coercible to)."

This is my process:

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve 18-01-04-list of 4650 crawled pages" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Local Repository/data/18-01-04-list of 4650 crawled pages"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="8.0.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34">
        <parameter key="create_nominal_ids" value="true"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="8.0.001" expanded="true" height="124" name="Loop Values" width="90" x="313" y="34">
        <parameter key="attribute" value="text"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words" width="90" x="45" y="34">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;&#10;rm_main = function(data)&#10;{&#10;  if(is.data.frame(data)){&#10;&#9;spon_words &lt;- data %&gt;%&#10;&#9;  unnest_tokens(bigram, text, token = &quot;ngrams&quot;, n = 2)&#10;&#9;  }&#10;&#9;print(spon_words)&#10;&#10;    return(list(spon_words))    &#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory" width="90" x="45" y="136"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat" width="90" x="45" y="238">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;library(tidyr)&#10;library(tokenizers)&#10;&#10;rm_main = function(data)&#10;{&#10;devided_bigrams &lt;-data %&gt;%&#10;&#9;separate(bigram, c(&quot;word1&quot;, &quot;word2&quot;), sep = &quot; &quot;)&#10;&#9;print(devided_bigrams)&#10; return(list(devided_bigrams))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams" width="90" x="179" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;&#10;&#9;count_bigrams &lt;- data %&gt;%&#10;&#9;  count(word1, word2, sort = TRUE)&#10;&#9;print(count_bigrams)&#10;&#10;&#9;counted_bigrams &lt;- data.frame(count_bigrams)&#10;   &#10;    return(counted_bigrams)&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (2)" width="90" x="313" y="187"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph" width="90" x="447" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;     library(igraph)&#10;&#10;     bigram_graph &lt;- data %&gt;%&#10;       filter(n &gt;= 10) %&gt;%&#10;       graph_from_data_frame&#10;      print(bigram_graph)&#10;    &#9;# bigram_graph &lt;- data.frame(bigram_graph)&#10;&#10;    &#9;library(ggraph)&#10;    &#9;set.seed(2017)&#10;&#10;    &#9;graph &lt;- ggraph(bigram_graph, layout = &quot;fr&quot;) +&#10;    &#9;  geom_edge_link() +&#10;    &#9;  geom_node_point() +&#10;    &#9;  geom_node_text(aes(label = name), vjust = 1, hjust =1)&#10;&#10;    &#9;setwd(&quot;/home/knecht&quot;)&#10;&#9;#graph.write(graph, &quot;/home/knecht/graph01.txt&quot;,, &quot;edgelist&quot;)&#10;    &#9;#ggsave(filename = &quot;foo300.png&quot;, width = 5, height = 4, dpi = 300, units = &quot;in&quot;, device='png')&#10;    &#9;    &#9;&#10;     return(list(graph))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (3)" width="90" x="581" y="187"/>
          <connect from_port="input 1" to_op="Split Text in Words" to_port="input 1"/>
          <connect from_op="Split Text in Words" from_port="output 1" to_op="Free Memory" to_port="through 1"/>
          <connect from_op="Free Memory" from_port="through 1" to_op="Seperat" to_port="input 1"/>
          <connect from_op="Seperat" from_port="output 1" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_port="output 1"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Count all Bigrams" to_port="input 1"/>
          <connect from_op="Count all Bigrams" from_port="output 1" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_port="output 2"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Free Memory (2)" to_port="through 1"/>
          <connect from_op="Free Memory (2)" from_port="through 1" to_op="draw graph" to_port="input 1"/>
          <connect from_op="draw graph" from_port="output 1" to_op="Free Memory (3)" to_port="through 1"/>
          <connect from_op="Free Memory (3)" from_port="through 1" to_port="output 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
          <portSpacing port="sink_output 4" spacing="0"/>
        </process>
      </operator>
      <operator activated="false" class="loop_examples" compatibility="8.0.001" expanded="true" height="124" name="Loop Examples" width="90" x="313" y="187">
        <process expanded="true">
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words (2)" width="90" x="45" y="34">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;&#10;rm_main = function(data)&#10;{&#10;  if(is.data.frame(data)){&#10;&#9;spon_words &lt;- data %&gt;%&#10;&#9;  unnest_tokens(bigram, text, token = &quot;ngrams&quot;, n = 2)&#10;&#9;  }&#10;&#9;print(spon_words)&#10;&#10;    return(list(spon_words))    &#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (4)" width="90" x="45" y="136"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat (2)" width="90" x="45" y="238">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;library(tidyr)&#10;library(tokenizers)&#10;&#10;rm_main = function(data)&#10;{&#10;devided_bigrams &lt;-data %&gt;%&#10;&#9;separate(bigram, c(&quot;word1&quot;, &quot;word2&quot;), sep = &quot; &quot;)&#10;&#9;print(devided_bigrams)&#10; return(list(devided_bigrams))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="179" y="34"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams (2)" width="90" x="179" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;&#10;&#9;count_bigrams &lt;- data %&gt;%&#10;&#9;  count(word1, word2, sort = TRUE)&#10;&#9;print(count_bigrams)&#10;&#10;&#9;counted_bigrams &lt;- data.frame(count_bigrams)&#10;   &#10;    return(counted_bigrams)&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (4)" width="90" x="313" y="85"/>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (5)" width="90" x="313" y="187"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph (2)" width="90" x="447" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;     library(igraph)&#10;&#10;     bigram_graph &lt;- data %&gt;%&#10;       filter(n &gt;= 10) %&gt;%&#10;       graph_from_data_frame&#10;      print(bigram_graph)&#10;    &#9;# bigram_graph &lt;- data.frame(bigram_graph)&#10;&#10;    &#9;library(ggraph)&#10;    &#9;set.seed(2017)&#10;&#10;    &#9;graph &lt;- ggraph(bigram_graph, layout = &quot;fr&quot;) +&#10;    &#9;  geom_edge_link() +&#10;    &#9;  geom_node_point() +&#10;    &#9;  geom_node_text(aes(label = name), vjust = 1, hjust =1)&#10;&#10;    &#9;setwd(&quot;/home/knecht&quot;)&#10;&#9;#graph.write(graph, &quot;/home/knecht/graph01.txt&quot;,, &quot;edgelist&quot;)&#10;    &#9;#ggsave(filename = &quot;foo300.png&quot;, width = 5, height = 4, dpi = 300, units = &quot;in&quot;, device='png')&#10;    &#9;    &#9;&#10;     return(list(graph))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (6)" width="90" x="581" y="187"/>
          <connect from_port="example set" to_op="Split Text in Words (2)" to_port="input 1"/>
          <connect from_op="Split Text in Words (2)" from_port="output 1" to_op="Free Memory (4)" to_port="through 1"/>
          <connect from_op="Free Memory (4)" from_port="through 1" to_op="Seperat (2)" to_port="input 1"/>
          <connect from_op="Seperat (2)" from_port="output 1" to_op="Multiply (3)" to_port="input"/>
          <connect from_op="Multiply (3)" from_port="output 1" to_port="example set"/>
          <connect from_op="Multiply (3)" from_port="output 2" to_op="Count all Bigrams (2)" to_port="input 1"/>
          <connect from_op="Count all Bigrams (2)" from_port="output 1" to_op="Multiply (4)" to_port="input"/>
          <connect from_op="Multiply (4)" from_port="output 1" to_port="output 1"/>
          <connect from_op="Multiply (4)" from_port="output 2" to_op="Free Memory (5)" to_port="through 1"/>
          <connect from_op="Free Memory (5)" from_port="through 1" to_op="draw graph (2)" to_port="input 1"/>
          <connect from_op="draw graph (2)" from_port="output 1" to_op="Free Memory (6)" to_port="through 1"/>
          <connect from_op="Free Memory (6)" from_port="through 1" to_port="output 2"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve 18-01-04-list of 4650 crawled pages" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop Values" from_port="output 2" to_port="result 2"/>
      <connect from_op="Loop Values" from_port="output 3" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Has maybe someone an idea how I can solve it?

regards

Tobias

MartinLiebig · January 2018

Hi,

You can use:

Normal Loop + Filter Examples Range on the iteration macro

or

Group into Collection + Loop Collection

Both should work fine for this.

Best,

Martin

sgenzer · January 2018

hello @TobiasNehrig - hmm I would much prefer to use the text processing extension and the operator "Process Documents from Data". Something like this with tokenize, generate n-gram, and so forth:

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve 18-01-04-list of 4650 crawled pages" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Local Repository/data/18-01-04-list of 4650 crawled pages"/>
      </operator>
      <operator activated="false" class="generate_id" compatibility="8.0.001" expanded="true" height="82" name="Generate ID" width="90" x="45" y="187">
        <parameter key="create_nominal_ids" value="true"/>
      </operator>
      <operator activated="false" class="concurrency:loop_values" compatibility="8.0.001" expanded="true" height="124" name="Loop Values" width="90" x="179" y="187">
        <parameter key="attribute" value="text"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words" width="90" x="45" y="34">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;&#10;rm_main = function(data)&#10;{&#10;  if(is.data.frame(data)){&#10;&#9;spon_words &lt;- data %&gt;%&#10;&#9;  unnest_tokens(bigram, text, token = &quot;ngrams&quot;, n = 2)&#10;&#9;  }&#10;&#9;print(spon_words)&#10;&#10;    return(list(spon_words))    &#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory" width="90" x="45" y="136"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat" width="90" x="45" y="238">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;library(tidyr)&#10;library(tokenizers)&#10;&#10;rm_main = function(data)&#10;{&#10;devided_bigrams &lt;-data %&gt;%&#10;&#9;separate(bigram, c(&quot;word1&quot;, &quot;word2&quot;), sep = &quot; &quot;)&#10;&#9;print(devided_bigrams)&#10; return(list(devided_bigrams))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams" width="90" x="179" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;&#10;&#9;count_bigrams &lt;- data %&gt;%&#10;&#9;  count(word1, word2, sort = TRUE)&#10;&#9;print(count_bigrams)&#10;&#10;&#9;counted_bigrams &lt;- data.frame(count_bigrams)&#10;   &#10;    return(counted_bigrams)&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (2)" width="90" x="313" y="187"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph" width="90" x="447" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;     library(igraph)&#10;&#10;     bigram_graph &lt;- data %&gt;%&#10;       filter(n &gt;= 10) %&gt;%&#10;       graph_from_data_frame&#10;      print(bigram_graph)&#10;    &#9;# bigram_graph &lt;- data.frame(bigram_graph)&#10;&#10;    &#9;library(ggraph)&#10;    &#9;set.seed(2017)&#10;&#10;    &#9;graph &lt;- ggraph(bigram_graph, layout = &quot;fr&quot;) +&#10;    &#9;  geom_edge_link() +&#10;    &#9;  geom_node_point() +&#10;    &#9;  geom_node_text(aes(label = name), vjust = 1, hjust =1)&#10;&#10;    &#9;setwd(&quot;/home/knecht&quot;)&#10;&#9;#graph.write(graph, &quot;/home/knecht/graph01.txt&quot;,, &quot;edgelist&quot;)&#10;    &#9;#ggsave(filename = &quot;foo300.png&quot;, width = 5, height = 4, dpi = 300, units = &quot;in&quot;, device='png')&#10;    &#9;    &#9;&#10;     return(list(graph))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (3)" width="90" x="581" y="187"/>
          <connect from_port="input 1" to_op="Split Text in Words" to_port="input 1"/>
          <connect from_op="Split Text in Words" from_port="output 1" to_op="Free Memory" to_port="through 1"/>
          <connect from_op="Free Memory" from_port="through 1" to_op="Seperat" to_port="input 1"/>
          <connect from_op="Seperat" from_port="output 1" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_port="output 1"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Count all Bigrams" to_port="input 1"/>
          <connect from_op="Count all Bigrams" from_port="output 1" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_port="output 2"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Free Memory (2)" to_port="through 1"/>
          <connect from_op="Free Memory (2)" from_port="through 1" to_op="draw graph" to_port="input 1"/>
          <connect from_op="draw graph" from_port="output 1" to_op="Free Memory (3)" to_port="through 1"/>
          <connect from_op="Free Memory (3)" from_port="through 1" to_port="output 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
          <portSpacing port="sink_output 4" spacing="0"/>
        </process>
      </operator>
      <operator activated="false" class="loop_examples" compatibility="8.0.001" expanded="true" height="124" name="Loop Examples" width="90" x="313" y="187">
        <process expanded="true">
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Split Text in Words (2)" width="90" x="45" y="34">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;&#10;rm_main = function(data)&#10;{&#10;  if(is.data.frame(data)){&#10;&#9;spon_words &lt;- data %&gt;%&#10;&#9;  unnest_tokens(bigram, text, token = &quot;ngrams&quot;, n = 2)&#10;&#9;  }&#10;&#9;print(spon_words)&#10;&#10;    return(list(spon_words))    &#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (4)" width="90" x="45" y="136"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Seperat (2)" width="90" x="45" y="238">
            <parameter key="script" value="library(dplyr)&#10;library(tidytext)&#10;library(tidyr)&#10;library(tokenizers)&#10;&#10;rm_main = function(data)&#10;{&#10;devided_bigrams &lt;-data %&gt;%&#10;&#9;separate(bigram, c(&quot;word1&quot;, &quot;word2&quot;), sep = &quot; &quot;)&#10;&#9;print(devided_bigrams)&#10; return(list(devided_bigrams))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="179" y="34"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count all Bigrams (2)" width="90" x="179" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;&#10;&#9;count_bigrams &lt;- data %&gt;%&#10;&#9;  count(word1, word2, sort = TRUE)&#10;&#9;print(count_bigrams)&#10;&#10;&#9;counted_bigrams &lt;- data.frame(count_bigrams)&#10;   &#10;    return(counted_bigrams)&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (4)" width="90" x="313" y="85"/>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (5)" width="90" x="313" y="187"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="draw graph (2)" width="90" x="447" y="187">
            <parameter key="script" value="rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#9;library(tidyr)&#10;     library(igraph)&#10;&#10;     bigram_graph &lt;- data %&gt;%&#10;       filter(n &gt;= 10) %&gt;%&#10;       graph_from_data_frame&#10;      print(bigram_graph)&#10;    &#9;# bigram_graph &lt;- data.frame(bigram_graph)&#10;&#10;    &#9;library(ggraph)&#10;    &#9;set.seed(2017)&#10;&#10;    &#9;graph &lt;- ggraph(bigram_graph, layout = &quot;fr&quot;) +&#10;    &#9;  geom_edge_link() +&#10;    &#9;  geom_node_point() +&#10;    &#9;  geom_node_text(aes(label = name), vjust = 1, hjust =1)&#10;&#10;    &#9;setwd(&quot;/home/knecht&quot;)&#10;&#9;#graph.write(graph, &quot;/home/knecht/graph01.txt&quot;,, &quot;edgelist&quot;)&#10;    &#9;#ggsave(filename = &quot;foo300.png&quot;, width = 5, height = 4, dpi = 300, units = &quot;in&quot;, device='png')&#10;    &#9;    &#9;&#10;     return(list(graph))&#10;}&#10;"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.0.001" expanded="true" height="82" name="Free Memory (6)" width="90" x="581" y="187"/>
          <connect from_port="example set" to_op="Split Text in Words (2)" to_port="input 1"/>
          <connect from_op="Split Text in Words (2)" from_port="output 1" to_op="Free Memory (4)" to_port="through 1"/>
          <connect from_op="Free Memory (4)" from_port="through 1" to_op="Seperat (2)" to_port="input 1"/>
          <connect from_op="Seperat (2)" from_port="output 1" to_op="Multiply (3)" to_port="input"/>
          <connect from_op="Multiply (3)" from_port="output 1" to_port="example set"/>
          <connect from_op="Multiply (3)" from_port="output 2" to_op="Count all Bigrams (2)" to_port="input 1"/>
          <connect from_op="Count all Bigrams (2)" from_port="output 1" to_op="Multiply (4)" to_port="input"/>
          <connect from_op="Multiply (4)" from_port="output 1" to_port="output 1"/>
          <connect from_op="Multiply (4)" from_port="output 2" to_op="Free Memory (5)" to_port="through 1"/>
          <connect from_op="Free Memory (5)" from_port="through 1" to_op="draw graph (2)" to_port="input 1"/>
          <connect from_op="draw graph (2)" from_port="output 1" to_op="Free Memory (6)" to_port="through 1"/>
          <connect from_op="Free Memory (6)" from_port="through 1" to_port="output 2"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="179" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve 18-01-04-list of 4650 crawled pages" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Does that help?

Scott

TobiasNehrig · January 2018

Hi @sgenzer

thanks for your answer. The input data was previosly crawled and pre-processed with Extract Content, Tokenize Sentences, Filer Stopwords and Filter by Length. But with the N-Gram Operator I didn’t get any further to create the graphs. So I try to implement the tidy text approach (www.tidytextmining.com) via the execute R Operator. With this approach I try also to find co-occurrences, if I get ahead to run the script over each of the 1975 examples.

Regards

Tobias

sgenzer · January 2018

oh I see. OK. So you want a bigram graph for each example or for the full data set?

Can you attach the data set as shown in that screenshot?

Scott

TobiasNehrig · January 2018

hi, yes i'd like to create a bigram graph for each example.

I don't know how to attach the data set, but this is the process to generate the data set:

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="82" name="Crawler Spon" width="90" x="45" y="34">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
            <parameter key="url" value="http://www.spiegel.de"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+www.spiegel.+"/>
              <parameter key="follow_link_with_matching_url" value=".+spiegel.+|.+de.+"/>
            </list>
            <parameter key="max_crawl_depth" value="10"/>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="add_content_as_attribute" value="true"/>
            <parameter key="max_pages" value="4650"/>
            <parameter key="delay" value="100"/>
            <parameter key="max_concurrent_connections" value="200"/>
            <parameter key="max_connections_per_host" value="100"/>
            <parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="34">
            <parameter key="link_attribute" value="Link"/>
            <parameter key="page_attribute" value="link"/>
            <parameter key="random_user_agent" value="true"/>
          </operator>
          <connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="8.0.001" expanded="true" height="68" name="Store" width="90" x="179" y="34">
        <parameter key="repository_entry" value="../data/18-01-04-crawler rund2000 spiegelseiten"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data Spon (2)" width="90" x="313" y="34">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_absolute" value="10"/>
        <parameter key="prune_above_absolute" value="3000"/>
        <parameter key="data_management" value="memory-optimized"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="link" value="1.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
            <parameter key="minimum_text_block_length" value="2"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize Token" width="90" x="179" y="34">
            <parameter key="mode" value="linguistic sentences"/>
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="99"/>
          </operator>
          <operator activated="false" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens a-zA-Z (2)" width="90" x="246" y="136">
            <parameter key="condition" value="matches"/>
            <parameter key="regular_expression" value="[a-zA-Z]+"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Tokenize Token" to_port="document"/>
          <connect from_op="Tokenize Token" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawler Spon" from_port="out 1" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_op="Process Documents from Data Spon (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data Spon (2)" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

TobiasNehrig · January 2018

Hi @mschmitz,

thank you very much for your advice. For your hint with the normal loop, it doesn't work because of the lack of RAM of my computer. But the collection loop work perfectly in a very short time.

regards

Tobias

MartinLiebig · January 2018

Hi @TobiasNehrig,

this most likely happens because the usual Loop runs in parallel which takes more memory. If you deactivate this it should also work with the usual loop.

Best,

Martin

sgenzer · January 2018

so nice little trick in RapidMiner - right-click on the results tab you want to save and then choose "Store ExampleSet in Repository":

Screen Shot 2018-01-08 at 9.14.44 AM.png

Then if you want to send it to someone, the easiest is just to locate it on your drive and use the "Choose Files" button here in the community post section to attach:

Screen Shot 2018-01-08 at 9.16.59 AM.png

OK now to your bigram graphs...stay tuned.

Scott

sgenzer · January 2018

ok so here's the story - you can do this the "hacker" way, or the "right" way. The "hacker" way is to use the very old, almost-deprecated Reporting Extension that will create a PDF with your graphs. I'm attaching a process to this post so you can see how I did this.

The reason that this is the "hacker" way is that RM Studio is not really designed to do BI stuff. It's a data science platform - we leave BI to others like Qlik, Tableau, and so forth OR we push results to production in RM Server. So the "right" way to do this is to use one of those techniques.

Process and sample result PDF attached. Note you will need to add the Reporting Extension to RM Studio.

Scott

TobiasNehrig · January 2018

Hi @sgenzer, thank you very much for your help. Well I choose the old 'fashion' way

Tobias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

loop a script over a large list of examples

Best Answer

Answers