The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"convert document files to transaction dataset"

panida_spanida_s Member Posts: 1 Learner III
edited June 2019 in Help
I am a new on text mining and rapidminer. I want to prepare a dataset  to create a model with my algorithm. The dataset should contain one row for each text document and each row consists of words contained in the document (separated by comma). Moreover,the words in dataset should be passed the preprocessing steps. token, stop word remove,stem, n-gram.

Please help me

Thank you

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    Typically you are not using a data structure for text mining where the terms are stored as strings separated by comma, but you create word vectors which have one attribute for every word. Every documents becomes a row (vector) and the values for the each attribute (word) depends on the vector creation method (usually you want to use TF-IDF).

    Here is an example process with two hard-coded documents (use "Process Documents from Files" to read from a set of files). Inside the "Process Documents" operator you will see a "Tokenize" and "Filter stopwords" operator.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="This is a book on data mining"/>
            <parameter key="label_value" value="text1"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="120">
            <parameter key="text" value="This book describes data mining and text mining using RapidMiner"/>
            <parameter key="label_value" value="text2"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="30">
            <parameter key="keep_text" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The resulting example set can be used to learn models like with any other numerical data set. In text mining it is common to use the SVM for classification, e.g..
Sign In or Register to comment.