"Regarding text mining"

ratheesan · November 2009

Hi,

Which Text Mining Operator can we use to extract combination of words or pattern of words in RM.
I have used string tokenizer,stopwordfilter and Token length filter.and find out TFIDF,Term Frequency e.t.c.
Can anybody suggest a specific algorithm for solving the problem.
Thanks
Ratheesan

land · November 2009

Hi,
you could use BinaryOccurrences instead of TFIDF and then convert the numerical 0's and 1's to binominal values in order to apply FP-Growth. You will get FrequentItemSets containing the words occurring together in documents. Using the support threshold you can control how frequent they have to occur together.

Greetings,
Sebastian

ratheesan · November 2009

Thanks Sebastain for your valued help and advice.I worked with the text like you mentioned.But I am getting an error message "Process failed.StackOverfloeError caught null".Here I am attaching the xml.

<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="b" value="C:\Documents and Settings\ADMIN\Desktop\b"/>
</list>
<parameter key="vector_creation" value="BinaryOccurrences"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="Numerical2Binominal" class="Numerical2Binominal">
<parameter key="min" value="2.0"/>
<parameter key="max" value="30000.0"/>
</operator>
<operator name="FPGrowth" class="FPGrowth">
<parameter key="keep_example_set" value="true"/>
<parameter key="min_number_of_itemsets" value="5"/>
</operator>
</operator>

How can I overcome this problem.

Thanks
Ratheesan

land · November 2009

Hi,
if you put a break point after the Numerical2Binominal operator, does the program reaches it?
If yes, I guess, the problem is the really memory consuming FP-Growth operator. The memory consumption depends heavily on the support level and you might increase it in order to get the things done. Of course you will receive less rules, because only rules with a higher support will be included at all.
Please take a look at the memory monitor, to check that you have assigned RapidMiner enough maim memory. It usually uses up to 80% of the RAM.

Greetings,
Sebastian

ratheesan · November 2009

Hi,
I applied decision tree in a text data.But not getting a proper result.Here I am attaching the process,Can you suggest me how to proceed this code.If my way is not correct ,could you please suggest an alternative.

<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="mydata" value="C:\Documents and Settings\ADMIN\Desktop\summary"/>
</list>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="claimant"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="DecisionTree" class="DecisionTree">
</operator>
</operator>

Thanks
Ratheesan

sudheendra · November 2009

Hai Sebastain,

I am also getting the same memory problem. I am using Windows OS with 3GB Ram. Is it quite sufficient to work. Please suggest

Thanks,
Sudheendra

land · November 2009

Hi,
TextMining usually incorporates a great number of attributes. A decision tree might become veeery large, if the data is difficult to split. You probably would gain a much better classification performance if you would use a linear SVM. But if your goal is an understandable model, you will have to stick with the tree, but you should limit its maximal depth to avoid the out of memory problem. Otherwise it wouldn't help the user anyway, because a tree with depth 10 would have 2047 nodes and already loses a lot of it's understandability

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Regarding text mining"

Answers