Relation between text and customer id

Viper1988 · July 2014

Hi,

I have a problem. I have one database table with the following columns: text, customer id and customer name.
Now I want to get a wordlist out of the text. So that I can see customer 1 has written the word "RapidMiner" five times and customer 2 has written "RapidMiner" four times and "Mining" three times.

Does anybody have an idea? Sry for my bad english :-[

Thank you very much!

Viper1988 · July 2014

My first thoughts:

I could read every single row and do a wordlist out of the Text coloumn of every row and write and append the wordlists of every row to the database and I could add the customer ID coloumn.
I have to convert the wordlist to data to write it in the database.

How can I read every single row? I dont understand the loops.
How can I add the customer ID coloumn?

Is that possible?

I need help

!

Viper1988 · July 2014

Here is what I did.
It write not for every row a wordlist. It writes the whole wordlist for all rows many times.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.015" expanded="true" height="60" name="Read Database" width="90" x="112" y="75">
        <parameter key="connection" value="twitterfb"/>
        <parameter key="query" value="SELECT `text`, `created`&#10;FROM `tweets`"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.3.015" expanded="true" height="76" name="Loop Examples" width="90" x="313" y="75">
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="112" y="30">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="180" y="30"/>
              <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="315" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="450" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="585" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
              <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:wordlist_to_data" compatibility="5.3.002" expanded="true" height="76" name="WordList to Data" width="90" x="246" y="30"/>
          <operator activated="true" class="write_database" compatibility="5.3.015" expanded="true" height="60" name="Write Database" width="90" x="380" y="30">
            <parameter key="connection" value="twitterfb"/>
            <parameter key="table_name" value="test"/>
            <parameter key="overwrite_mode" value="append"/>
          </operator>
          <connect from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="WordList to Data" from_port="example set" to_op="Write Database" to_port="input"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

No one who could help me?

awchisholm · July 2014

Hello

You can do it without using Loop Examples

Here's an example.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="get data" width="90" x="112" y="120">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd1 (2)" width="90" x="112" y="75">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd2 (2)" width="90" x="112" y="165">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd3 (2)" width="90" x="112" y="255">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="6.0.008" expanded="true" height="112" name="Append (2)" width="90" x="313" y="120"/>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="Append (2)" from_port="merged set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="6.0.008" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="120">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="120">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="75"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="313" y="75"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="514" y="75"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

Viper1988 · August 2014

Thank you for your help! But I have thousands of rows with text and about 100 different customers.

Any idea?

Viper1988 · August 2014

That is what I need, but I dont know how to write the words with the value 1 in a database in relation to the text.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.015" expanded="true" height="60" name="Read Database" width="90" x="45" y="300">
        <parameter key="connection" value="twitterfb"/>
        <parameter key="query" value="SELECT `word`, `total`&#10;FROM `twitterwordlist`"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.015" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="390">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|word"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="246" y="300">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.3.015" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="75">
        <parameter key="connection" value="twitterfb"/>
        <parameter key="query" value="SELECT `text`, `created`&#10;FROM `tweets`"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="165">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens" width="90" x="45" y="30">
            <list key="replace_dictionary">
              <parameter key="Ã¶" value="ö"/>
              <parameter key="Ã¼" value="ü"/>
              <parameter key="ÃŸ" value="ß"/>
              <parameter key="Ã¤" value="ä"/>
              <parameter key="â€“" value="-"/>
              <parameter key="â€ž" value="&quot;"/>
              <parameter key="â€œ" value="&quot;"/>
              <parameter key="Ã–" value="Ö"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="180" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="315" y="30">
            <parameter key="max_chars" value="50"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="450" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="585" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="678" y="30"/>
          <connect from_port="document" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Marco_Boeck · August 2014

Hi,

If you have multiple rows per customer, you can merge them and then proceed. This is a slightly adapted version of the process awchisholm posted earlier:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="6.0.009" expanded="true" height="76" name="get data" width="90" x="44" y="30">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd1 (2)" width="90" x="112" y="75">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd2 (2)" width="90" x="112" y="165">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd3 (2)" width="90" x="112" y="255">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd2 (3)" width="90" x="112" y="345">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner is awesome!&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="6.0.009" expanded="true" height="130" name="Append (2)" width="90" x="313" y="120"/>
          <operator activated="true" class="aggregate" compatibility="6.0.009" expanded="true" height="76" name="Aggregate" width="90" x="447" y="120">
            <list key="aggregation_attributes">
              <parameter key="text" value="concatenation"/>
            </list>
            <parameter key="group_by_attributes" value="created"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.0.009" expanded="true" height="76" name="Rename" width="90" x="581" y="120">
            <parameter key="old_name" value="concat(text)"/>
            <parameter key="new_name" value="text"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="gd2 (3)" from_port="output" to_op="Append (2)" to_port="example set 4"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="6.0.009" expanded="true" height="76" name="Nominal to Text" width="90" x="178" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.003" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.003" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.003" expanded="true" height="60" name="Transform Cases" width="90" x="178" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.003" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regargs,
Marco

Viper1988 · August 2014

Thank you, but how can I write this result in a database. After I execute this process with my data, I have about 20000 attributes (different words).
How is the best way to process this result to write it in a database?

Is it possible to get it in a database table like this:
word customer
bad customer 1
rapidminer customer 2
mining customer 3
great customer 2
mining customer 2
mining customer 2

Is that possible or does anybody have another idea?

Best regards

Marco_Boeck · August 2014

Hi,

I have modifed the process a bit more so it only returns 3 columns (customer/word/count) which exactly represents what you wanted to achieve in your original post


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="6.0.009" expanded="true" height="76" name="get data" width="90" x="45" y="30">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd1 (2)" width="90" x="44" y="30">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd2 (2)" width="90" x="44" y="120">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd3 (2)" width="90" x="44" y="210">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.009" expanded="true" height="60" name="gd2 (3)" width="90" x="44" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner is awesome!&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="6.0.009" expanded="true" height="130" name="Append (2)" width="90" x="178" y="120"/>
          <operator activated="true" class="aggregate" compatibility="6.0.009" expanded="true" height="76" name="Aggregate" width="90" x="312" y="165">
            <list key="aggregation_attributes">
              <parameter key="text" value="concatenation"/>
            </list>
            <parameter key="group_by_attributes" value="created"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.0.009" expanded="true" height="76" name="Rename" width="90" x="447" y="165">
            <parameter key="old_name" value="concat(text)"/>
            <parameter key="new_name" value="text"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="gd2 (3)" from_port="output" to_op="Append (2)" to_port="example set 4"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="6.0.009" expanded="true" height="76" name="Nominal to Text" width="90" x="178" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.003" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.003" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.003" expanded="true" height="60" name="Transform Cases" width="90" x="178" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.003" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="de_pivot" compatibility="6.0.009" expanded="true" height="76" name="De-Pivot" width="90" x="447" y="30">
        <list key="attribute_name">
          <parameter key="count" value="^(?!created).*$"/>
        </list>
        <parameter key="index_attribute" value="word"/>
        <parameter key="create_nominal_index" value="true"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="6.0.009" expanded="true" height="94" name="Filter Examples" width="90" x="581" y="30">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count.gt.0"/>
        </list>
      </operator>
      <operator activated="true" class="sort" compatibility="6.0.009" expanded="true" height="76" name="Sort" width="90" x="715" y="30">
        <parameter key="attribute_name" value="count"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="De-Pivot" to_port="example set input"/>
      <connect from_op="De-Pivot" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,
Marco

Viper1988 · August 2014

Thank you very much Marco

!
That is exactly what I'm looking for.

What is the function of the attribute "^(?!created).*$" in the De-Pivot Operator?
Is it possible to kick the words with the count "0" out in the De-Pivot Operator, because I get an Memory Error.
I also tested the Stream Database Operator. Any Ideas?

Thank you so much

!

awchisholm · August 2014

My turn to answer

The "^(?!created).*$" is a regular expression that selects any attribute not called "created".

You could try setting the 0 values to missing before the De-Pivot operator as in the following process.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="get data" width="90" x="45" y="30">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd1 (2)" width="90" x="44" y="30">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd2 (2)" width="90" x="44" y="120">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd3 (2)" width="90" x="44" y="210">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd2 (3)" width="90" x="44" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner is awesome!&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="6.0.008" expanded="true" height="130" name="Append (2)" width="90" x="178" y="120"/>
          <operator activated="true" class="aggregate" compatibility="6.0.008" expanded="true" height="76" name="Aggregate" width="90" x="312" y="165">
            <list key="aggregation_attributes">
              <parameter key="text" value="concatenation"/>
            </list>
            <parameter key="group_by_attributes" value="created"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.0.008" expanded="true" height="76" name="Rename" width="90" x="447" y="165">
            <parameter key="old_name" value="concat(text)"/>
            <parameter key="new_name" value="text"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="gd2 (3)" from_port="output" to_op="Append (2)" to_port="example set 4"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="6.0.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="178" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="6.0.008" expanded="true" height="76" name="Declare Missing Value" width="90" x="447" y="30">
        <parameter key="numeric_value" value="0.0"/>
      </operator>
      <operator activated="true" class="de_pivot" compatibility="6.0.008" expanded="true" height="76" name="De-Pivot" width="90" x="648" y="30">
        <list key="attribute_name">
          <parameter key="count" value="^(?!created).*$"/>
        </list>
        <parameter key="index_attribute" value="word"/>
        <parameter key="create_nominal_index" value="true"/>
      </operator>
      <operator activated="true" class="sort" compatibility="6.0.008" expanded="true" height="76" name="Sort" width="90" x="916" y="30">
        <parameter key="attribute_name" value="count"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Declare Missing Value" to_port="example set input"/>
      <connect from_op="Declare Missing Value" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
      <connect from_op="De-Pivot" from_port="example set output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I don't have your data so I can't be sure it will help.

Andrew

Viper1988 · August 2014

Thank you for your answer.
It works, but I think it does not help me, because the "Declare missing value" operator runs for about 75 mins and the test database table only has about 1.000 rows. Other ones has about one million rows.

Is it possible to change the regular expression in the "De-Pivot" Operator to take the words > 0 only?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="get data" width="90" x="45" y="30">
        <process expanded="true">
          <operator activated="false" class="generate_data_user_specification" compatibility="5.3.015" expanded="true" height="60" name="gd1 (2)" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="read_database" compatibility="5.3.015" expanded="true" height="60" name="Read Database" width="90" x="313" y="75">
            <parameter key="connection" value="twitterfb"/>
            <parameter key="query" value="SELECT `text`, `created`&#10;FROM `tweets`"/>
            <enumeration key="parameters"/>
          </operator>
          <operator activated="false" class="generate_data_user_specification" compatibility="5.3.015" expanded="true" height="60" name="gd2 (2)" width="90" x="45" y="210">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="false" class="generate_data_user_specification" compatibility="5.3.015" expanded="true" height="60" name="gd3 (2)" width="90" x="45" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="false" class="generate_data_user_specification" compatibility="5.3.015" expanded="true" height="60" name="gd2 (3)" width="90" x="45" y="390">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner is awesome!&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="false" class="append" compatibility="5.3.015" expanded="true" height="130" name="Append (2)" width="90" x="179" y="210"/>
          <operator activated="true" class="aggregate" compatibility="5.3.015" expanded="true" height="76" name="Aggregate" width="90" x="447" y="210">
            <list key="aggregation_attributes">
              <parameter key="text" value="concatenation"/>
            </list>
            <parameter key="group_by_attributes" value="created"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename" width="90" x="581" y="210">
            <parameter key="old_name" value="concat(text)"/>
            <parameter key="new_name" value="text"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="Read Database" from_port="output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="gd2 (3)" from_port="output" to_op="Append (2)" to_port="example set 4"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.015" expanded="true" height="76" name="Nominal to Text" width="90" x="178" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="178" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30">
            <parameter key="max_chars" value="99"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="447" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="581" y="30"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="165"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="5.3.015" expanded="true" height="76" name="Declare Missing Value" width="90" x="447" y="165">
        <parameter key="numeric_value" value="0.0"/>
      </operator>
      <operator activated="true" class="de_pivot" compatibility="5.3.015" expanded="true" height="76" name="De-Pivot" width="90" x="447" y="30">
        <list key="attribute_name">
          <parameter key="count" value="^(?!created).*$"/>
        </list>
        <parameter key="index_attribute" value="word"/>
        <parameter key="create_nominal_index" value="true"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples" width="90" x="581" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="count &gt;= 1"/>
      </operator>
      <operator activated="true" class="sort" compatibility="5.3.015" expanded="true" height="76" name="Sort" width="90" x="715" y="30">
        <parameter key="attribute_name" value="count"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Declare Missing Value" to_port="example set input"/>
      <connect from_op="Declare Missing Value" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
      <connect from_op="De-Pivot" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Do you have any other ideas ?

Thank you very much for your help!

Best regards

awchisholm · August 2014

The "keep missings" flag in the De-Pivot operator is doing what you want; it's unexpected to see Declare Missing Values take so long.

There are ways out of this but they are beginning to get advanced. The general approach I would take is to split the data into batches. One very simple way is to use Loop Examples in the attached (which has three options - the original, and 2 alternatives for Loop Examples - comment out the ones you don't want).

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="get data" width="90" x="45" y="30">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd1 (2)" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner Once&quot;"/>
              <parameter key="created" value="&quot;Customer1&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd2 (2)" width="90" x="45" y="210">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner again and again with a side order of RapidAnalytics and did I mention RapidMiner?&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd3 (2)" width="90" x="45" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;If I hear RapidMiner one more time...&quot;"/>
              <parameter key="created" value="&quot;Customer3&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.0.008" expanded="true" height="60" name="gd2 (3)" width="90" x="45" y="390">
            <list key="attribute_values">
              <parameter key="text" value="&quot;RapidMiner is awesome!&quot;"/>
              <parameter key="created" value="&quot;Customer2&quot;"/>
            </list>
            <list key="set_additional_roles">
              <parameter key="created" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="6.0.008" expanded="true" height="130" name="Append (2)" width="90" x="246" y="210"/>
          <operator activated="true" class="aggregate" compatibility="6.0.006" expanded="true" height="76" name="Aggregate" width="90" x="447" y="210">
            <list key="aggregation_attributes">
              <parameter key="text" value="concatenation"/>
            </list>
            <parameter key="group_by_attributes" value="created"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.0.008" expanded="true" height="76" name="Rename" width="90" x="581" y="210">
            <parameter key="old_name" value="concat(text)"/>
            <parameter key="new_name" value="text"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="gd1 (2)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="gd2 (2)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="gd3 (2)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
          <connect from_op="gd2 (3)" from_port="output" to_op="Append (2)" to_port="example set 4"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="6.0.003" expanded="true" height="76" name="Nominal to Text" width="90" x="178" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="1"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="178" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30">
            <parameter key="max_chars" value="99"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="447" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="581" y="30"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="165"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="6.0.008" expanded="true" height="112" name="Multiply" width="90" x="313" y="142"/>
      <operator activated="true" class="loop_examples" compatibility="6.0.008" expanded="true" height="94" name="Loop Examples (2)" width="90" x="447" y="300">
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="6.0.008" expanded="true" height="76" name="Filter Example Range (2)" width="90" x="112" y="30">
            <parameter key="first_example" value="%{example}"/>
            <parameter key="last_example" value="%{example}"/>
          </operator>
          <operator activated="true" class="de_pivot" compatibility="6.0.008" expanded="true" height="76" name="De-Pivot (3)" width="90" x="246" y="165">
            <list key="attribute_name">
              <parameter key="count" value="^(?!created).*$"/>
            </list>
            <parameter key="index_attribute" value="word"/>
            <parameter key="create_nominal_index" value="true"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="6.0.008" expanded="true" height="94" name="Filter Examples" width="90" x="380" y="165">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="count.gt.0"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Filter Example Range (2)" to_port="example set input"/>
          <connect from_op="Filter Example Range (2)" from_port="example set output" to_op="De-Pivot (3)" to_port="example set input"/>
          <connect from_op="Filter Example Range (2)" from_port="original" to_port="example set"/>
          <connect from_op="De-Pivot (3)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="6.0.008" expanded="true" height="76" name="Append (3)" width="90" x="715" y="300"/>
      <operator activated="true" class="loop_examples" compatibility="6.0.008" expanded="true" height="94" name="Loop Examples" width="90" x="447" y="165">
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="6.0.008" expanded="true" height="76" name="Filter Example Range" width="90" x="112" y="30">
            <parameter key="first_example" value="%{example}"/>
            <parameter key="last_example" value="%{example}"/>
          </operator>
          <operator activated="true" class="declare_missing_value" compatibility="6.0.008" expanded="true" height="76" name="Declare Missing Value" width="90" x="246" y="165">
            <parameter key="numeric_value" value="0.0"/>
          </operator>
          <operator activated="true" class="de_pivot" compatibility="6.0.008" expanded="true" height="76" name="De-Pivot" width="90" x="380" y="165">
            <list key="attribute_name">
              <parameter key="count" value="^(?!created).*$"/>
            </list>
            <parameter key="index_attribute" value="word"/>
            <parameter key="create_nominal_index" value="true"/>
          </operator>
          <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Declare Missing Value" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="original" to_port="example set"/>
          <connect from_op="Declare Missing Value" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
          <connect from_op="De-Pivot" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="6.0.008" expanded="true" height="76" name="Append" width="90" x="715" y="165"/>
      <operator activated="true" class="declare_missing_value" compatibility="6.0.008" expanded="true" height="76" name="Declare Missing Value (2)" width="90" x="447" y="30">
        <parameter key="numeric_value" value="0.0"/>
      </operator>
      <operator activated="true" class="de_pivot" compatibility="6.0.008" expanded="true" height="76" name="De-Pivot (2)" width="90" x="581" y="30">
        <list key="attribute_name">
          <parameter key="count" value="^(?!created).*$"/>
        </list>
        <parameter key="index_attribute" value="word"/>
        <parameter key="create_nominal_index" value="true"/>
      </operator>
      <operator activated="true" class="sort" compatibility="6.0.008" expanded="true" height="76" name="Sort" width="90" x="715" y="30">
        <parameter key="attribute_name" value="count"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <connect from_op="get data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Declare Missing Value (2)" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Multiply" from_port="output 3" to_op="Loop Examples (2)" to_port="example set"/>
      <connect from_op="Loop Examples (2)" from_port="output 1" to_op="Append (3)" to_port="example set 1"/>
      <connect from_op="Append (3)" from_port="merged set" to_port="result 3"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 2"/>
      <connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="De-Pivot (2)" to_port="example set input"/>
      <connect from_op="De-Pivot (2)" from_port="example set output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

One of the two options doesn't rely on missing values so the use of Loop Examples might help with the original memory problem.

I don't have your data so I have no idea whether it will help and you might have duplicate customers that will need aggregating - I will leave that as an exercise.,,

Andrew

Viper1988 · August 2014

But the Keep missings flag is already not selected and the words with count > 0 are still in the results?

The both loops takes much more time, than the original way. The original one takes about 45 secounds and the loops take about 10 mins.
Is that normal?

Is there a way to make it faster or antoher idea to throw the words with count = 0 out in the original way before or in the De Pivot Operator?

Thank you so much for your help

!

awchisholm · August 2014

If the data contains missing values, the "De-Pivot" operator will not output them if "Keep Missings" is cleared. if the data contains no missing values then this flag has no meaning. Essentially, I am using rather a neat trick to get the De-Pivot operator to throw away words where the count is zero. It's a shame the Declare Missing Values takes so long. If that was quicker then it would have been "problem solved". I can't, at the moment, think of a way to throw away words with zero count. The problem is that the choice of the attributes to de-pivot depends on their names. Each attribute has multiple values so it's only possible to know whether the value should be thrown away when it is combined with a row within an example.

I would be very interested in an exact comparison between the 2 loops because one uses the Missing Data method and the other doesn't. The Missing Data method should be slower. Could you say how many examples and how many attributes there are in the input data and could you say whether this is a test set or the full data?

Generally looping is slower but the gain is that less memory may be used and so the whole process may (eventually) complete. The original problem was that you were running out of memory; do you also have a time constraint as well?

It may be that the way you are starting RapidMiner limits the available memory. If you search this forum you will find ways to check this. Of course, you could buy more memory

regards

Andrew

Marco_Boeck · August 2014

Hi,

just a quick information: the trick via "Declare Missing Value" by awchisholm should actually be the most efficient way to do this as that operator should not be slow. All it does is iterate over all selected attributes and for each of them over all examples. I just did this for 100000 numeric attributes and 1000 examples (i.e. 100 million values) on my dev machine, took about 30 seconds.

If process execution becomes really slow usually the cause is that RM Studio does not have enough memory available (click "View" -> "System Monitor" in the top menu bar to check) and therefore Java desperately tries to free some memory. If you're using RM Studio 5, all you need to do is let Studio use more memory (or execute it on a machine with more memory).

Regards,
Marco

Viper1988 · August 2014

The system monitor shows: Max: 1.1GB and Total: 174Mb. It is not enough?

How can I let RM Studio use more memory?
I changed the rapidminer.bat in the scripts folder but nothing happened.

I only have 4 GB on the machine. Do you think it is enough?

The smallest table has 2.100 examples with about 17.000 attributes as input for the "Declare missing value" operator.

Best regards

Okay, I think I found it. But how much memory should RM use?
I gave RM 1.500 MB but it uses only a little bit of it during the "Declare missing value" operator. Is it possible that RM uses the whole memory so that the "Declare missing value" operator finished faster? It run for about 10 mins and is not finished :-/ .

I slowly despair of that fact.

Or is there a complete other solution to get the relation between the words and the customer ID?

Marco_Boeck · August 2014

Hi,

My test on a Core i5 with RapidMiner Studio 6 and 100.000.000 values took 30 seconds and used just above 3GB of memory. It will use the available memory automatically if needed, for your small dataset is won't need much at all.
I'm sorry, but I'm afraid I have no idea why your Declare Missing Values operator is taking ages..

Regards,
Marco

Viper1988 · August 2014

Hmm, okay. Thank you for your help!

I hope that Andrew can help and has another idea...

awchisholm · August 2014

I've managed to reduce the time "Declare Missing Values" takes by about 2 orders of magnitude by using the operator "Materialize Data" before it.

On an example set of 652 examples and 5068 attributes the time reduced from 520 seconds to 2 seconds.

On a larger example set of 65248 examples and 5081 attributes the time with Materialize Data was 170 seconds.

regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Relation between text and customer id

Answers