"FP-Growth process fails"

hhassanien · March 2018

Hello ,

The attached process had failed on the FP-Growth node with an error saying:

Process Failed

Exception: java.lang.StackOverflowError

hhassanien · March 2018

Please also find the process attached herewith.

Pavithra_Rao · March 2018

Hi @hhassanien

Could you please share the data files that I used in the attached process.

Also sharing the log files will help debug issue easily...

The studio logs can be found in :

C:\users\<username>\.RapidMiner\

Cheers

sgenzer · March 2018

hi @hhassanien - yes that looks like a problem. Pushing to Product Feedback.

[EDIT: @Pavithra_Rao I used "Data Mining for the Masses" pdf and got the same error. It's attached. Modified XML below.]

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="238">
        <parameter key="file" value="/Users/GenzerConsulting/Desktop/DataMiningForTheMasses.pdf"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="238">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="85"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="45" y="187"/>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="289"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="45" y="391"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="45" y="493">
            <parameter key="max_length" value="4"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="246" y="187">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="version"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="313" y="289">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="aasher"/>
            <parameter key="regular_expression" value="asher"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="447" y="289">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="document"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="187">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="hyperone"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="514" y="85">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="page"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="581" y="187">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="process"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="715" y="85">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="author"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
          <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
          <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
          <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
          <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="8.1.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="380" y="85"/>
      <operator activated="true" class="fp_growth" compatibility="8.1.001" expanded="true" height="82" name="FP-Growth" width="90" x="514" y="85"/>
      <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Scott

sgenzer · March 2018

yyhuang · March 2018

Hi @hhassanien,

Thanks for sharing the data and process. Do you want to use FP-Growth algorithm to find the group of keywords that always co-exist in some documents?

Here are only 5 documents and you will get a very wide table, 5 rows, 50k columns after text processing. Wow, that is 10000 times! It will cause heap space issue for such small transaction but huge items... b/c for all keywords show in one single document will be associated in a rule with at least 20% (1/5=0.2) support and 100% confidence, which result in millions of association rules for 50k keywords.

Ideally we want an input data with more transaction(usually > 200 rows of transactions) for market basket analysis (FP-G). So some workarounds for your document analysis:

1. You can add more documents to increase number of examples, and reduce the number of columns by prunining on keywords or filter on tokens. I modified a little bit on the text mining process by adding pruning to on the corpus. The binominal data set used in fp-growth get dimmensional reduction to 5 by 400. It created 16 millions of frequent items (keywords).

Warning: the code below may need at least 2 min to run FP-Growth on the reduced data set for a laptop with RAM 32GB. If you need to create associate rules out of the freuqent items from FP-Growth, run it on a server with even more memory.

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="8.1.001" expanded="true" height="68" name="Free Memory" width="90" x="45" y="34"/>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="FICO" width="90" x="179" y="34">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\FICO.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="MM" width="90" x="179" y="136">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\MM.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="SD" width="90" x="179" y="238">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\SD.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="HCM" width="90" x="179" y="340">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\HCM.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Integration" width="90" x="179" y="442">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\Integration.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" breakpoints="after" class="text:process_documents" compatibility="8.1.000" expanded="true" height="187" name="Process Documents" width="90" x="447" y="85">
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="3"/>
        <parameter key="prune_above_absolute" value="5"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
          <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="136"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="version"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="aasher"/>
            <parameter key="regular_expression" value="asher"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="983" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="document"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="1117" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="hyperone"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="1251" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="page"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="1385" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="process"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="1519" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="author"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="1519" y="136">
            <parameter key="max_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
          <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
          <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
          <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
          <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">remove those words that show in every document and remove those words only showed in one doc</description>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="8.1.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="648" y="34"/>
      <operator activated="true" breakpoints="after" class="fp_growth" compatibility="8.1.001" expanded="true" height="82" name="FP-Growth" width="90" x="782" y="34">
        <parameter key="find_min_number_of_itemsets" value="false"/>
        <parameter key="max_number_of_retries" value="10"/>
        <parameter key="min_support" value="0.9"/>
      </operator>
      <operator activated="true" class="create_association_rules" compatibility="8.1.001" expanded="true" height="82" name="Create Association Rules" width="90" x="916" y="34">
        <parameter key="min_confidence" value="1.0"/>
      </operator>
      <connect from_op="FICO" from_port="output" to_op="Process Documents" to_port="documents 5"/>
      <connect from_op="MM" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="SD" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="HCM" from_port="output" to_op="Process Documents" to_port="documents 3"/>
      <connect from_op="Integration" from_port="output" to_op="Process Documents" to_port="documents 4"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
      <connect from_op="Create Association Rules" from_port="item sets" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="147"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

2. Transpose your document-term matrix, and get a new data matrix with 5 columns, then you can use pair-wised word-word distance to find groups of words with high similarities..

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="8.1.001" expanded="true" height="68" name="Free Memory" width="90" x="45" y="34"/>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="FICO" width="90" x="179" y="34">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\FICO.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="MM" width="90" x="179" y="136">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\MM.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="SD" width="90" x="179" y="238">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\SD.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="HCM" width="90" x="179" y="340">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\HCM.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Integration" width="90" x="179" y="442">
        <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\Integration.pdf"/>
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" breakpoints="after" class="text:process_documents" compatibility="8.1.000" expanded="true" height="187" name="Process Documents" width="90" x="447" y="85">
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="prune_below_absolute" value="3"/>
        <parameter key="prune_above_absolute" value="5"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
          <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="136"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34">
            <parameter key="min_chars" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="version"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="aasher"/>
            <parameter key="regular_expression" value="asher"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="983" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="document"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="1117" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="hyperone"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="1251" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="page"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="1385" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="process"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="1519" y="34">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="author"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="1653" y="85">
            <parameter key="max_length" value="4"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
          <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
          <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
          <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
          <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="8.1.001" expanded="true" height="82" name="Transpose" width="90" x="581" y="85"/>
      <operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="715" y="85"/>
      <operator activated="true" class="similarity_to_data" compatibility="8.1.001" expanded="true" height="82" name="Similarity to Data (2)" width="90" x="849" y="85"/>
      <operator activated="true" class="sort" compatibility="8.1.001" expanded="true" height="82" name="Sorted Similarity" width="90" x="983" y="85">
        <parameter key="attribute_name" value="DISTANCE"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <connect from_op="FICO" from_port="output" to_op="Process Documents" to_port="documents 5"/>
      <connect from_op="MM" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="SD" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="HCM" from_port="output" to_op="Process Documents" to_port="documents 3"/>
      <connect from_op="Integration" from_port="output" to_op="Process Documents" to_port="documents 4"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data (2)" to_port="similarity"/>
      <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data (2)" to_port="exampleSet"/>
      <connect from_op="Similarity to Data (2)" from_port="exampleSet" to_op="Sorted Similarity" to_port="example set input"/>
      <connect from_op="Sorted Similarity" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="147"/>
    </process>
  </operator>
</process>

3. Run word2vec (available in word2vec extension from marketplace) on the documents to extract the vocabulary and their context with deep learning neural network.

Please check out the knowledge base article by Dr Martin Schmitz

https://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Synonym-Detection-with-Word2Vec/ta-p/43860

Cheers,

YY

sgenzer · March 2018

wow - thank you @Pavithra_Rao for such a detailed and helpful response!

sgenzer · March 2018

Unfortunately we're going to decline to fix this. Two reasons: 1) as @Pavithra_Rao showed, there is a good workaround for this and in fact what she shows is likely best practice anyway; 2) the FP-Growth operator is being rebuilt from the ground-up right now.

yyhuang · March 2018

We will have an improved FP-Growth operator in our next release 8.2

It will be much faster with the new data core implementation and also compatible with transactional data like

TransactionID item1|item2|item3|item4

Kudos to @gmeier !

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"FP-Growth process fails"

Fixed and Released · Last Updated March 2019

Comments