The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Regarding text mining"

ratheesanratheesan Member Posts: 68 Maven
edited May 2019 in Help
Hi,

Which Text Mining Operator can we use to extract combination of words or pattern of words in RM.
I have used string tokenizer,stopwordfilter and  Token length filter.and find out TFIDF,Term Frequency e.t.c.
Can anybody suggest a specific algorithm for solving the problem.
Thanks
Ratheesan

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you could use BinaryOccurrences instead of TFIDF and then convert the numerical 0's and 1's to binominal values in order to apply FP-Growth. You will get FrequentItemSets containing the words occurring together in documents. Using the support threshold you can control how frequent they have to occur together.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Thanks Sebastain for your valued help and advice.I worked with the text like you mentioned.But I am getting an error message "Process failed.StackOverfloeError caught null".Here I am attaching the xml.

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="b" value="C:\Documents and Settings\ADMIN\Desktop\b"/>
            </list>
            <parameter key="vector_creation" value="BinaryOccurrences"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal">
            <parameter key="min" value="2.0"/>
            <parameter key="max" value="30000.0"/>
        </operator>
        <operator name="FPGrowth" class="FPGrowth">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="min_number_of_itemsets" value="5"/>
        </operator>
    </operator>

    How can I overcome this problem.

    Thanks
    Ratheesan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if you put a break point after the Numerical2Binominal operator, does the program reaches it?
    If yes, I guess, the problem is the really memory consuming FP-Growth operator. The memory consumption depends heavily on the support level and you might increase it in order to get the things done. Of course you will receive less rules, because only rules with a higher support will be included at all.
    Please take a look at the memory monitor, to check that you have assigned RapidMiner enough maim memory. It usually uses up to 80% of the RAM.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hi,
    I applied decision tree in a text data.But not getting a proper result.Here I am attaching the process,Can you suggest me how to proceed this code.If my way is not correct ,could you please suggest an alternative.

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="mydata" value="C:\Documents and Settings\ADMIN\Desktop\summary"/>
            </list>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
        </operator>
        <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
            <parameter key="name" value="claimant"/>
            <parameter key="target_role" value="label"/>
        </operator>
        <operator name="DecisionTree" class="DecisionTree">
        </operator>
    </operator>

    Thanks
    Ratheesan
  • sudheendrasudheendra Member Posts: 22 Maven
    Hai Sebastain,

    I am also getting the same memory problem. I am using Windows OS with 3GB Ram. Is it quite sufficient to work. Please suggest

    Thanks,
    Sudheendra
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    TextMining usually incorporates a great number of attributes. A decision tree might become veeery large, if the data is difficult to split. You probably would gain a much better classification performance if you would use a linear SVM. But if your goal is an understandable model, you will have to stick with the tree, but you should limit its maximal depth to avoid the out of memory problem. Otherwise it wouldn't help the user anyway, because a tree with depth 10 would have 2047 nodes and already loses a lot of it's understandability :)

    Greetings,
      Sebastian
Sign In or Register to comment.