The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"FP-Growth fails with 4GB of memory"
RMSchwartz
Member Posts: 2 Contributor I
I can't get FP-growth to complete. I have allocated 4GB of memory to MAX_JAVA_MEMORY and that amount shows up in the system monitor within RapidMiner. I've put a small sample in the chain so that it has only about 150 cases to deal with. Nonetheless, it fails to execute to the end, exhausting 4GB of memory.
I'd welcome some assistance.
Thanks,
Bob Schwartz
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<description>Reads collections of text from a set of directories, assigning each directory to a class (as specified by parameter text_directories), and transforms them into a TF-IDF or other word vector. Finally, an SVM is applied to model the input texts.</description>
<process expanded="true" height="476" width="547">
<operator activated="true" class="retrieve" compatibility="5.1.006" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//NewLocalRepository/BPP Fishing/July25a"/>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="5.1.001" expanded="true" height="76" name="WordList to Data" width="90" x="112" y="120"/>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="3"/>
<parameter key="prune_above_absolute" value="99"/>
<list key="specify_weights"/>
<process expanded="true">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="5.1.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="45" y="345">
<parameter key="min" value="0.05"/>
<parameter key="max" value="5.0"/>
</operator>
<operator activated="true" class="sample" compatibility="5.1.006" expanded="true" height="76" name="Sample" width="90" x="179" y="345">
<parameter key="sample" value="probability"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<operator activated="true" class="fp_growth" compatibility="5.1.006" expanded="true" height="76" name="FP-Growth" width="90" x="246" y="255">
<parameter key="min_number_of_itemsets" value="10"/>
<parameter key="min_support" value="0.1"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="5.1.006" expanded="true" height="76" name="Create Association Rules" width="90" x="380" y="210"/>
<connect from_op="Retrieve" from_port="output" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules" from_port="item sets" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
Your sampled dataset has not many rows indeed, but very likely it has very many attributes... as obtained from text preprocessing. The data source was not available, but most likely your problem is due to the following:
In the Numerical to Binominal operator you set min=0.05 and max=5. Why? You should have set min=0 max=0.
With your own setting of min and max, when the above operator is executed, for each document, its relevant words (seen as attributes in the dataset here) are assigned the value false in its word vector, and all the words not in the document - imagine how many!! are assigned the value true in the same word vector. Doing so you gave a lot of work to do to the FP-Growth operator that will have to make a lot of combinations of words that were assigned true in order to obtain the frequent itemsets, so the 4GB and even more would not be enough, by far.
min=0 and max=0 will make all the words not in the document to be assigned false, and all the words in the document to be assigned true, and you may have a chance to get your results, assuming you do some more preprocessing as filtering stopwords, which again increase exponentially the number of combinations when computing the frequent itemsets, since they may be many enough in each document and can repeat themselves across most of if not all the documents ...
Dan
It may be that your process will finish if you disable the Association Rule operator, the reasons for this are set out here...
http://rapid-i.com/rapidforum/index.php/topic,3619.msg13530.html#msg13530
Just a thought, good luck!
Best,
Bob
Have a dataset with 3 columns (Transaction ID, Product Description, Value) and appox 1 million rows.
I am trying to apply FP-Growth and Create Association but this keeps failing due to memory at the "Numerical to Binomial" stage of my process . I have allocated 56GB of RAM.
"This process would need more than the maximum amount of available memory. You can either leave......"
Am I doing something wrong here? I would have thought 56GB of RAM would be more than enough to cope with this.
Any help will be much appreciated
Thanks.