The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"text mining visulaization"

emolanoemolano Member Posts: 13 Contributor II
edited May 2019 in Help
Hi all,
Help for a new user! I'm doing some text mining and want to visualize the word frequency.  how can I do this?
something like a tag cloud/word cloud would be nice.
This is what I have so far...
<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#CRM Data Mining#ylt#/h3#ygt##ylt#p#ygt#.#ylt#/p#ygt#"/>
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url" value="jdbc:mysql://test:3306/test"/>
        <parameter key="username" value="test"/>
        <parameter key="password" value="C2jgjgjh4JiellkjDOm4="/>
        <parameter key="query" value="SELECT `ID_NUM`, `SHORT_DESC`, `PLATFORM` FROM `PROBLEM` WHERE platform is not null;"/>
        <parameter key="label_attribute" value="PLATFORM"/>
        <parameter key="id_attribute" value="ID_NUM"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="filter_nominal_attributes" value="true"/>
        <parameter key="remove_original_attributes" value="true"/>
        <parameter key="vector_creation" value="TermOccurrences"/>
        <parameter key="output_word_list" value="C:\Documents and Settings\emolano\My Documents\rm_workspace\output"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="min_chars" value="2"/>
        </operator>
        <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
        </operator>
        <operator name="PorterStemmer" class="PorterStemmer">
        </operator>
        <operator name="TermNGramGenerator" class="TermNGramGenerator">
        </operator>
    </operator>
</operator>


... I get the word frequency but not know hot to visualize it...
Thanks
e

Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    I would suggest a parallel plot - at least if you have less than a few thousand terms. Alternatively, you could also use the CorpusBasedWeighting for each class and visualize the different weight vectors.

    Cheers,
    Ingo
  • derchiefderchief Member Posts: 5 Contributor II
    Hi Ingo,

    you said "the CorpusBasedWeighting for each class". How can I define such a class? In my case, the values of the Weighting are 0 or 1, which seems to deliver no usable results.

    I have two further related questions:

    1) In my setting, I am loading some txts and get a list of words with values like "avg = 0.029 +/- 0.167". I don´t understand exactly, what this means. Can I group the words using this information depending on their occurence in the source-files?

    2) But most important is that I would like to seperate my txts in groups and visualize their analyses to compare them. For a tiny example, one group could be femal, one group is male text and I would like to compare the usage of words or combination of words (like: these are typical female phrases:...). Is there a possibility to tell rapid-miner which text belongs to which group and to consider this information?


    Cheers,
    Chris


    Setting:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="NoteLens" value="C:\Dokumente und Einstellungen\cniemann\Eigene Dateien\NoteLens Documents\store"/>
            </list>
            <parameter key="default_content_type" value="txt"/>
            <parameter key="default_content_encoding" value="UTF-8"/>
            <parameter key="default_content_language" value="german"/>
            <parameter key="vector_creation" value="TermOccurrences"/>
            <parameter key="id_attribute_type" value="short"/>
            <list key="namespaces">
            </list>
            <parameter key="create_text_visualizer" value="true"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
        </operator>
        <operator name="CorpusBasedWeighting" class="CorpusBasedWeighting">
            <parameter key="normalize_weights" value="false"/>
            <parameter key="class_to_characterize" value="3"/>
        </operator>
    </operator>

Sign In or Register to comment.