"Java Heap space ERROR"

noah977 · November 2008

Hello,

I am attempting to do some basic toeknization of text files. I will then attempt to cluster them

Right now, I am testing with only 200 small text files. RM processes for a while and then gives me an out of memory error. I have given 1GIG of memory to RM.

I would eventually like to use RM to cluster batches of 1,000 or even 10,000 files, but am concerned that I can not even do the basic tokenization of only 200.

Please let me know if you have any ideas or suggestions.

Thanks!!

---------------------

Below is the XML of my process


<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="TextInput" class="TextInput" expanded="yes">
          <parameter key="create_text_visualizer"	value="true"/>
          <parameter key="default_content_language"	value="english"/>
          <list key="namespaces">
          </list>
          <parameter key="on_the_fly_pruning"	value="0"/>
          <parameter key="prune_below"	value="10%"/>
          <list key="texts">
            <parameter key="News_Articles"	value="/Users/noah/Desktop/test_files"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"	value="3"/>
          </operator>
          <operator name="TermNGramGenerator" class="TermNGramGenerator">
          </operator>
      </operator>
  </operator>

</process>

land · November 2008

Hi,
the process just runs fine at my machine with the sample newgroup texts of the text miner plugin.
Without the data I can't really say, where the error comes from, so I have to ask a few questions.
Could you describe the memory monitor behavior before the error? Or even post a picture?

Greetings,
Sebastian

noah977 · November 2008

Sebastian,

Thanks for the reply.

I have been able to easily run the sample without any problems. The sample is only a few files.

My files are all TEXT with no HTML or XML. They vary in length from 4K to 400K. The total size of all 50 test files is 15.8M

I assigned 1025m to RM before starting. The memory monitor shows the memory growing and shrinking over time, but mostly growing.

I noticed that some of the steps are running "492" times. This seems odd since I only have 50 files. Is that a clue??

I am getting a few strange warnings in the log:

P Nov 25, 2008 3:07:43 PM: Process:
   Root[1] (Process)
   +- TextInput[1] (TextInput)
      +- StringTokenizer[492] (StringTokenizer)
      +- EnglishStopwordFilter[492] (EnglishStopwordFilter)
      +- TokenLengthFilter[492] (TokenLengthFilter)
      +- TermNGramGenerator[492] (TermNGramGenerator)
P Nov 25, 2008 3:07:43 PM: [Warning] TextInput: Warning: Encoding  unknown. Using default.
Last message repeated 906 times.
P Nov 25, 2008 3:08:15 PM: [Warning] TextInput: The original example example set already contains an attribute named "label". This is likely to cause trouble. Please rename the attribute in the original example set.
P Nov 25, 2008 3:08:15 PM: [Warning] TextInput: There is a term that equals the class attribute, renaming it
P Nov 25, 2008 3:09:04 PM: [Warni

IngoRM · November 2008

Hi,

I noticed that some of the steps are running "492" times. This seems odd since I only have 50 files. Is that a clue??

I think so. The log message also indicate that there are more than 50 files which are processed:

P Nov 25, 2008 3:07:43 PM: [Warning] TextInput: Warning: Encoding unknown. Using default.
Last message repeated 906 times.

So it might be that there is some issue with your directory setup, hidden files, ....

By the way, we frequently work on text classification of thousands of texts without any problem, for short texts it's even hundreds of thousand texts. Of course the settings of the parameters have a great influence of memory usage. For example using n-grams or not enough pruning will blow up the number of dimensions a lot for large texts with many different words.

Cheers,
Ingo

noah977 · November 2008

Hi,

I think you're correct about two things:

1) There was a hidden directory. I have corrected the problem and now there are REALLY 50 files.

2) I was attempting to create two word tokens. From your answer, I thing that I may be creating a large amount of features this way.

I would love to get some help on clustering documents. If you are ever available to do any consulting, please let me know.

Thanks!!!

IngoRM · November 2008

Hi,

ad 1) good to hear.

ad 2) Yes. Let's say your texts have a length of 1000 words in average and are quite different. Then you will end with up to 1000000 attributes. Each attribute contains meta data of about 1 KB summing up to about 1 Gig plus data size plus... It is usually always worse to have huge numbers of attributes than huge numbers of examples.

Hence, using word n-grams is only applicable for short texts with similar words. Something similar holds for character n-grams. But from my experience, the latter only help for shorts texts anyway.

If you are ever available to do any consulting, please let me know.

I actually do consulting (never noticed the company "Rapid-I" behind RapidMiner?)

. Please check out our web site at http://rapid-i.com or contact us for an offer.

Cheers,
Ingo

noah977 · November 2008

Ingo,

1) That makes perfect sense. Don't know why I didn't see this before. Without the n-gram step, the process finished MUCH faster.

2) Now I need to figure out the best way to cluster the documents. I trying to find some function that will decide on an ideal number of clusters based in "similarity" of documents. (If I were instructing a person, I would tell them, "group these documents into piles that make sense. Put documents with similar topics or ideas together.) Do you know of a function that can do some intelligent grouping based on similarity - find the common "themes"

3) I did see that Rapid-I offered some courses which are far from me, so I am unable to attend. I might just want to hire an hour or two of phone time with someone. Is that possible.

Thanks!!!

-N

IngoRM · November 2008

Hello again,

Do you know of a function that can do some intelligent grouping based on similarity - find the common "themes"

RM provides a lot of clustering methods, some of them allow the definition of the similarity measure. For texts, usually cosine similarity is a good choice. But then you still have to tweak the number of clusters. There are some approaches for this (search, e.g. for the elbow criterion) and you can get the basic ideas from the samples in the clustering directory delivered together with RM.

I might just want to hire an hour or two of phone time with someone. Is that possible.

Sure, we also offer phone consulting on an hour base. Please contact us via the contact form on our web site.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Java Heap space ERROR"

Answers