The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Java Heap space ERROR"
Hello,
I am attempting to do some basic toeknization of text files. I will then attempt to cluster them
Right now, I am testing with only 200 small text files. RM processes for a while and then gives me an out of memory error. I have given 1GIG of memory to RM.
I would eventually like to use RM to cluster batches of 1,000 or even 10,000 files, but am concerned that I can not even do the basic tokenization of only 200.
Please let me know if you have any ideas or suggestions.
Thanks!!
---------------------
Below is the XML of my process
I am attempting to do some basic toeknization of text files. I will then attempt to cluster them
Right now, I am testing with only 200 small text files. RM processes for a while and then gives me an out of memory error. I have given 1GIG of memory to RM.
I would eventually like to use RM to cluster batches of 1,000 or even 10,000 files, but am concerned that I can not even do the basic tokenization of only 200.
Please let me know if you have any ideas or suggestions.
Thanks!!
---------------------
Below is the XML of my process
<process version="4.2">
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_language" value="english"/>
<list key="namespaces">
</list>
<parameter key="on_the_fly_pruning" value="0"/>
<parameter key="prune_below" value="10%"/>
<list key="texts">
<parameter key="News_Articles" value="/Users/noah/Desktop/test_files"/>
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="TermNGramGenerator" class="TermNGramGenerator">
</operator>
</operator>
</operator>
</process>
Tagged:
0
Answers
the process just runs fine at my machine with the sample newgroup texts of the text miner plugin.
Without the data I can't really say, where the error comes from, so I have to ask a few questions.
Could you describe the memory monitor behavior before the error? Or even post a picture?
Greetings,
Sebastian
Thanks for the reply.
I have been able to easily run the sample without any problems. The sample is only a few files.
My files are all TEXT with no HTML or XML. They vary in length from 4K to 400K. The total size of all 50 test files is 15.8M
I assigned 1025m to RM before starting. The memory monitor shows the memory growing and shrinking over time, but mostly growing.
I noticed that some of the steps are running "492" times. This seems odd since I only have 50 files. Is that a clue??
I am getting a few strange warnings in the log:
By the way, we frequently work on text classification of thousands of texts without any problem, for short texts it's even hundreds of thousand texts. Of course the settings of the parameters have a great influence of memory usage. For example using n-grams or not enough pruning will blow up the number of dimensions a lot for large texts with many different words.
Cheers,
Ingo
I think you're correct about two things:
1) There was a hidden directory. I have corrected the problem and now there are REALLY 50 files.
2) I was attempting to create two word tokens. From your answer, I thing that I may be creating a large amount of features this way.
I would love to get some help on clustering documents. If you are ever available to do any consulting, please let me know.
Thanks!!!
ad 1) good to hear.
ad 2) Yes. Let's say your texts have a length of 1000 words in average and are quite different. Then you will end with up to 1000000 attributes. Each attribute contains meta data of about 1 KB summing up to about 1 Gig plus data size plus... It is usually always worse to have huge numbers of attributes than huge numbers of examples.
Hence, using word n-grams is only applicable for short texts with similar words. Something similar holds for character n-grams. But from my experience, the latter only help for shorts texts anyway. I actually do consulting (never noticed the company "Rapid-I" behind RapidMiner?) . Please check out our web site at http://rapid-i.com or contact us for an offer.
Cheers,
Ingo
1) That makes perfect sense. Don't know why I didn't see this before. Without the n-gram step, the process finished MUCH faster.
2) Now I need to figure out the best way to cluster the documents. I trying to find some function that will decide on an ideal number of clusters based in "similarity" of documents. (If I were instructing a person, I would tell them, "group these documents into piles that make sense. Put documents with similar topics or ideas together.) Do you know of a function that can do some intelligent grouping based on similarity - find the common "themes"
3) I did see that Rapid-I offered some courses which are far from me, so I am unable to attend. I might just want to hire an hour or two of phone time with someone. Is that possible.
Thanks!!!
-N
Cheers,
Ingo