The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"text mining Excel file"

rdmckinneyrdmckinney Member Posts: 15 Maven
edited May 2019 in Help
I didn't find my topic with a search, so please redirect me if you have discussed this elsewhere. I have an Excel file with comments from members. I want to mine the comments as if each member/record is a document. I can get the Excel file into Rapidminer easily with ExcelExampleSource, but when I connect that to TextInput I get an error message: "Error in: TextInput (TextInput) The attribute 'text_source' does not exist. The example set does not contain an attribute with the given name." What should be my next step after the ExcelExampleSource?
Thanks!
Roger D. McKinney

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee-RapidMiner, Member Posts: 295 RM Product Management
    Hello Roger,

    use the [tt]StringTextInput[/tt] operator instead of the [tt]TextInput[/tt] operator.

    Kind regards,
    Tobias
  • Legacy UserLegacy User Member Posts: 0 Newbie
    I tried the StringTextInput operator but the ExcelExampleSource operator doesn't allow me to designate a field as string and the StringTextInput operator looks for a field designated as string. I finally just saved the Excel file as a tab delimited file and imported it with the ExampleSource operator followed by the StringTextInput operator and that works fine.
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    PS, For every document/example in my file, I get this error message: "[Warning] StringTextInput: Warning: Encoding  unknown. Using default." Should I worry about this?
    Thanks!
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee-RapidMiner, Member Posts: 295 RM Product Management
    Hi,
    rdmckinney wrote:

    PS, For every document/example in my file, I get this error message: "[Warning] StringTextInput: Warning: Encoding  unknown. Using default." Should I worry about this?
    Thanks!
    well if the data shows up correctly, you do not need to worry! ;)

    Btw.: the [tt]Nominal2String[/tt] operator converts nominal to string columns. That way, you could load the texts directly from the excel file.

    Kind regards,
    Tobias
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    Thanks for the tip!

    I'm making progress  :o I am running the following code and so far it has taken 43 minutes. Is that normal?

    <operator name="Root" class="Process" expanded="yes">
        <description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Documents and Settings\rkenney\My Documents\RapidMiner\TextExamples\MSSComments.aml"/>
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="remove_original_attributes" value="true"/>
            <parameter key="default_content_language" value="english"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
        </operator>
        <operator name="EMClustering" class="EMClustering">
            <parameter key="k" value="5"/>
        </operator>
    </operator>
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    Sorry, I forgot to tell you that the input to the EM clustering operator has 600 examples and about 1,600 attributes. It's now up to 1 hr 9 minutes.
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    I stopped the EM clustering after about 2 hrs and substituted K-means. It ran in about 20 seconds, which is when I realized that it clusters examples and not attributes. I need to cluster attributes because each attribute is a word from a text mining problem. Is there an operator that will transpose a data set?
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    Never mind! I realized my mistake. I need to apply dimension reduction, such as principal components to the attributes, then cluster. Sorry!
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    Is there any way to re-run just one operator in a chain? The reason I ask is I have a model that imports data from Excel, uses the stemmers, tokenizers and stopword filters to create a data set using the stringtextinput, then I apply the GHA operator. Do I always need to have the program run through all operators each time when all I really want to do is re-run the GHA operator with different settings? Thanks!
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    I have a more serious issue now. I'm getting these messages: May 1, 2009 11:50:55 AM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of GHA (GHA)
    G May 1, 2009 11:50:55 AM: [Fatal] Process failed: operator cannot be executed (6). Check the log messages...

    Here's my code:

    <operator name="Root" class="Process" expanded="yes">
        <description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Documents and Settings\rkenney\My Documents\RapidMiner\TextExamples\MSSComments.aml"/>
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="remove_original_attributes" value="true"/>
            <parameter key="default_content_language" value="english"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
        </operator>
        <operator name="GHA" class="GHA">
            <parameter key="number_of_components" value="6"/>
            <parameter key="number_of_iterations" value="100"/>
        </operator>
  • rdmckinneyrdmckinney Member Posts: 15 Maven
    I need to add that in the problem above I am trying to reduce 1,600 variables to as few components as possible. If I choose -1 as my number of components, then the program works fine and creates 1,600 components. But If I try to limit the number of components to even 200, I get the error message.
Sign In or Register to comment.