The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Regarding text mining"
Hi,
Which Text Mining Operator can we use to extract combination of words or pattern of words in RM.
I have used string tokenizer,stopwordfilter and Token length filter.and find out TFIDF,Term Frequency e.t.c.
Can anybody suggest a specific algorithm for solving the problem.
Thanks
Ratheesan
Which Text Mining Operator can we use to extract combination of words or pattern of words in RM.
I have used string tokenizer,stopwordfilter and Token length filter.and find out TFIDF,Term Frequency e.t.c.
Can anybody suggest a specific algorithm for solving the problem.
Thanks
Ratheesan
Tagged:
0
Answers
you could use BinaryOccurrences instead of TFIDF and then convert the numerical 0's and 1's to binominal values in order to apply FP-Growth. You will get FrequentItemSets containing the words occurring together in documents. Using the support threshold you can control how frequent they have to occur together.
Greetings,
Sebastian
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="b" value="C:\Documents and Settings\ADMIN\Desktop\b"/>
</list>
<parameter key="vector_creation" value="BinaryOccurrences"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="Numerical2Binominal" class="Numerical2Binominal">
<parameter key="min" value="2.0"/>
<parameter key="max" value="30000.0"/>
</operator>
<operator name="FPGrowth" class="FPGrowth">
<parameter key="keep_example_set" value="true"/>
<parameter key="min_number_of_itemsets" value="5"/>
</operator>
</operator>
How can I overcome this problem.
Thanks
Ratheesan
if you put a break point after the Numerical2Binominal operator, does the program reaches it?
If yes, I guess, the problem is the really memory consuming FP-Growth operator. The memory consumption depends heavily on the support level and you might increase it in order to get the things done. Of course you will receive less rules, because only rules with a higher support will be included at all.
Please take a look at the memory monitor, to check that you have assigned RapidMiner enough maim memory. It usually uses up to 80% of the RAM.
Greetings,
Sebastian
I applied decision tree in a text data.But not getting a proper result.Here I am attaching the process,Can you suggest me how to proceed this code.If my way is not correct ,could you please suggest an alternative.
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="mydata" value="C:\Documents and Settings\ADMIN\Desktop\summary"/>
</list>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="claimant"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="DecisionTree" class="DecisionTree">
</operator>
</operator>
Thanks
Ratheesan
I am also getting the same memory problem. I am using Windows OS with 3GB Ram. Is it quite sufficient to work. Please suggest
Thanks,
Sudheendra
TextMining usually incorporates a great number of attributes. A decision tree might become veeery large, if the data is difficult to split. You probably would gain a much better classification performance if you would use a linear SVM. But if your goal is an understandable model, you will have to stick with the tree, but you should limit its maximal depth to avoid the out of memory problem. Otherwise it wouldn't help the user anyway, because a tree with depth 10 would have 2047 nodes and already loses a lot of it's understandability
Greetings,
Sebastian