The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"FP-Growth process fails"
hhassanien
Member Posts: 2 Contributor I
Hello ,
The attached process had failed on the FP-Growth node with an error saying:
Process Failed
Exception: java.lang.StackOverflowError
Tagged:
1
Comments
Please also find the process attached herewith.
Hi @hhassanien
Could you please share the data files that I used in the attached process.
Also sharing the log files will help debug issue easily...
The studio logs can be found in :
C:\users\<username>\.RapidMiner\
Cheers
hi @hhassanien - yes that looks like a problem. Pushing to Product Feedback.
[EDIT: @Pavithra_Rao I used "Data Mining for the Masses" pdf and got the same error. It's attached. Modified XML below.]
Scott
Hi @hhassanien,
Thanks for sharing the data and process. Do you want to use FP-Growth algorithm to find the group of keywords that always co-exist in some documents?
Here are only 5 documents and you will get a very wide table, 5 rows, 50k columns after text processing. Wow, that is 10000 times! It will cause heap space issue for such small transaction but huge items... b/c for all keywords show in one single document will be associated in a rule with at least 20% (1/5=0.2) support and 100% confidence, which result in millions of association rules for 50k keywords.
Ideally we want an input data with more transaction(usually > 200 rows of transactions) for market basket analysis (FP-G). So some workarounds for your document analysis:
1. You can add more documents to increase number of examples, and reduce the number of columns by prunining on keywords or filter on tokens. I modified a little bit on the text mining process by adding pruning to on the corpus. The binominal data set used in fp-growth get dimmensional reduction to 5 by 400. It created 16 millions of frequent items (keywords).
Warning: the code below may need at least 2 min to run FP-Growth on the reduced data set for a laptop with RAM 32GB. If you need to create associate rules out of the freuqent items from FP-Growth, run it on a server with even more memory.
2. Transpose your document-term matrix, and get a new data matrix with 5 columns, then you can use pair-wised word-word distance to find groups of words with high similarities..
3. Run word2vec (available in word2vec extension from marketplace) on the documents to extract the vocabulary and their context with deep learning neural network.
Please check out the knowledge base article by Dr Martin Schmitz
https://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Synonym-Detection-with-Word2Vec/ta-p/43860
Cheers,
YY
wow - thank you @Pavithra_Rao for such a detailed and helpful response!
Unfortunately we're going to decline to fix this. Two reasons: 1) as @Pavithra_Rao showed, there is a good workaround for this and in fact what she shows is likely best practice anyway; 2) the FP-Growth operator is being rebuilt from the ground-up right now.
We will have an improved FP-Growth operator in our next release 8.2
It will be much faster with the new data core implementation and also compatible with transactional data like
TransactionID item1|item2|item3|item4
Kudos to @gmeier !