The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"creating a Word List"

siawlingsiawling Member Posts: 4 Contributor I
edited June 2019 in Help
hi,

I would like to create a word list for a list of documents.

I read fromthe word vector tool tutorial that the following chain of operators can help : TextInput, CorpusBasedWeighting and InteractiveAttributeWeighting.

I tried it but at lost at what to fill in for parameter class_to_characterize for CorpusBasedWeighting. I have no class label specified as there is no class involved. I have document name as ID and the content as attribute for the input.

Appreciate any advice and guidance.

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you cannot use the weighting if you don't have labels. That's because the weighting expresses the importance of words for distinguishing documents of the different labels. If you don't have labels, there's nothing to distinguish and hence no weighting...

    What do you need the word list for? Perhaps you can simply generate the standard word list automatically by processing the documents using something like that:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="306" width="748">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="51" y="86">
            <parameter key="text" value="Hi,&#10;you cannot use the weighting if you don't have labels. That's because the weighting expresses the importance of words for distinguishing documents of the different labels. If you don't have labels, there's nothing to distinguish and hence no weighting...&#10;&#10;What do you need the word list for? Perhaps you can simply generate the standard word list automatically by processing the documents using something like that:&#10;"/>
          </operator>
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document (2)" width="90" x="51" y="236">
            <parameter key="text" value="hi,&#10;&#10;I would like to create a word list for a list of documents.&#10;&#10;I read fromthe word vector tool tutorial that the following chain of operators can help : TextInput, CorpusBasedWeighting and InteractiveAttributeWeighting.&#10;&#10;I tried it but at lost at what to fill in for parameter class_to_characterize for CorpusBasedWeighting. I have no class label specified as there is no class involved. I have document name as ID and the content as attribute for the input.&#10;&#10;Appreciate any advice and guidance."/>
          </operator>
          <operator activated="true" class="text:documents_to_data" expanded="true" height="94" name="Example Data" width="90" x="246" y="165">
            <parameter key="text_attribute" value="text"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="514" y="165">
            <list key="specify_weights"/>
            <process expanded="true" height="444" width="828">
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Example Data" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Example Data" to_port="documents 2"/>
          <connect from_op="Example Data" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
    Sebastian
Sign In or Register to comment.