Why is Stem (Dictionary) not working?

roger_rutishaus · December 2018

Hi,

I use "Stem (Dictionary)", to which i connected "Open File" that loads a .txt file.

In the txt file are the entries, like:

jugendlich:jugendlich jugendliche jugendlichem jugendlichen jugendlicher jugendliches

jugendpflegerisch:jugendpflegerisch jugendpflegerische jugendpflegerischem jugendpflegerischen jugendpflegerischer jugendpflegerisches

jugoslawisch:jugoslawisch jugoslawische jugoslawischem jugoslawischen jugoslawischer jugoslawisches

jung:jung junge jungem jungen

The stemmer does not work. The wordlist results still delivers "jugendlichen" instead of "jugendlich".
What am I doing wrong? Thanks for your help!

Roger

complete settings:

<div class="Spoiler"><pre class="CodeBlock"><?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (2)" width="90" x="45" y="34">
        <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Web"/>
        <parameter key="recursive" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34">
            <parameter key="extract_text_only" value="false"/>
            <parameter key="content_type" value="html"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">HTML-Dateien</description>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="9.0.003" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
        <process expanded="true">
          <operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="45" y="34"/>
          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="body" value="&lt;body.*[\s\S]+&lt;/body&gt;"/>
            </list>
            <list key="regular_region_queries">
              <parameter key="body" value="&lt;body\.*&gt;.&lt;\\/body&gt;"/>
            </list>
            <list key="xpath_queries">
              <parameter key="inhalt_html-dokumente" value="//h:div[@id=&quot;content_center&quot;]//h:div[@class=&quot;conttext&quot;][text()]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="112" y="34">
                <parameter key="minimum_text_block_length" value="6"/>
              </operator>
              <operator activated="true" class="text:filter_documents_by_content" compatibility="8.1.000" expanded="true" height="82" name="Filter Documents (by Content)" width="90" x="246" y="34">
                <parameter key="condition" value="contains match"/>
                <parameter key="regular_expression" value="."/>
              </operator>
              <connect from_port="segment" to_op="Extract Content (2)" to_port="document"/>
              <connect from_op="Extract Content (2)" from_port="document" to_op="Filter Documents (by Content)" to_port="documents 1"/>
              <connect from_op="Filter Documents (by Content)" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="single" to_op="HTML to XML" to_port="document"/>
          <connect from_op="HTML to XML" from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Nur relevanter Text behalten</description>
      </operator>
      <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (3)" width="90" x="45" y="187">
        <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Projekt"/>
        <parameter key="recursive" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="34">
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_port="file object" to_op="Read Document (2)" to_port="file"/>
          <connect from_op="Read Document (2)" from_port="output" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">TXT-Dateien</description>
      </operator>
      <operator activated="true" class="collect" compatibility="9.0.003" expanded="true" height="103" name="Collect (2)" width="90" x="313" y="136">
        <description align="center" color="transparent" colored="false" width="126">Quelldokumente sammeln</description>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="447" y="136">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="99999"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34">
            <parameter key="mode" value="regular expression"/>
            <parameter key="characters" value=" "/>
            <parameter key="expression" value="((-[^a-zA-Z])+)|(([^a-zA-Z]{1,}-)+)|([^a-zA-Zäöü0-9-]+)"/>
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="313" y="34">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="100"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="179" y="136">
            <parameter key="file" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\stopwords-de-solariz-small.txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="313" y="136"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="136"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="313" y="238">
            <parameter key="condition" value="contains match"/>
            <parameter key="string" value="^[0-9]"/>
            <parameter key="regular_expression" value="^[^0-9].*"/>
          </operator>
          <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="238">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="false" class="text:filter_tokens_by_pos" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="514" y="340">
            <parameter key="language" value="German"/>
            <parameter key="expression" value="NE"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="false" class="text:stem_german" compatibility="8.1.000" expanded="true" height="68" name="Stem (German)" width="90" x="447" y="493"/>
          <operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="112" y="544">
            <parameter key="filename" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\rogerwordlist3.txt"/>
          </operator>
          <operator activated="true" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="246" y="442"/>
          <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="648" y="34"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Stem (Dictionary)" to_port="document"/>
          <connect from_op="Open File" from_port="file" to_op="Stem (Dictionary)" to_port="file"/>
          <connect from_op="Stem (Dictionary)" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Dokumente verarbeiten</description>
      </operator>
      <operator activated="true" class="write_excel" compatibility="9.0.003" expanded="true" height="82" name="Write Excel (2)" width="90" x="514" y="34">
        <parameter key="excel_file" value="D:\Dropbox\_BT\Textanalyse\terms-multimediaprod.xlsx"/>
        <parameter key="number_format" value="#.000"/>
      </operator>
      <operator activated="false" class="text:process_documents" compatibility="8.1.000" expanded="true" height="82" name="Process Documents" width="90" x="246" y="595">
        <process expanded="true">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop Files (2)" from_port="output 1" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Collect (2)" to_port="input 1"/>
      <connect from_op="Loop Files (3)" from_port="output 1" to_op="Collect (2)" to_port="input 2"/>
      <connect from_op="Collect (2)" from_port="collection" to_op="Process Documents (2)" to_port="documents 1"/>
      <connect from_op="Process Documents (2)" from_port="example set" to_op="Write Excel (2)" to_port="input"/>
      <connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/>
      <connect from_op="Write Excel (2)" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process></pre></div>

Telcontar120 · December 2018

This may be a bug, I will let the developers speak to that.
But in the meantime, if you want a workaround you can try the Stem Tokens Using Exampleset operator, which allows you to put your desired stemming into a normal dataset. This operator is part of the free Operator Toolbox extension.

sgenzer · December 2018

hi @roger_rutishaus - can you please provide the txt file and maybe a simpler process so I can reproduce?

Scott

roger_rutishaus · December 2018

thank yout both for your answers!

@Telcontar120
i don't know what process you mean (no process found by the name of "stem tokens")

@sgenzer
stemming file is attached.
new, simplyfied process is as follows:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (3)" width="90" x="45" y="187">
        <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Projekt"/>
        <parameter key="recursive" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="34">
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_port="file object" to_op="Read Document (2)" to_port="file"/>
          <connect from_op="Read Document (2)" from_port="output" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">TXT-Dateien</description>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="447" y="136">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="99999"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34">
            <parameter key="mode" value="regular expression"/>
            <parameter key="characters" value=" "/>
            <parameter key="expression" value="((-[^a-zA-Z])+)|(([^a-zA-Z]{1,}-)+)|([^a-zA-Zäöü0-9-]+)"/>
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
          <operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="112" y="544">
            <parameter key="filename" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\rogerwordlist3.txt"/>
          </operator>
          <operator activated="true" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="246" y="442"/>
          <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="648" y="34"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Stem (Dictionary)" to_port="document"/>
          <connect from_op="Open File" from_port="file" to_op="Stem (Dictionary)" to_port="file"/>
          <connect from_op="Stem (Dictionary)" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Dokumente verarbeiten</description>
      </operator>
      <connect from_op="Loop Files (3)" from_port="output 1" to_op="Process Documents (2)" to_port="documents 1"/>
      <connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

lionelderkrikor · December 2018

Hi @roger_rutishaus,

To detail @Telcontar120 's proposition, you have to :
- Go to the MarketPlace and install the Operator Toolbox extension.
- Then follow the instructions in this screenshot :

Image: https://us.v-cdn.net/6030995/uploads/editor/9r/b96jvpsokjxy.png

I hope it helps,

Regards,

Lionel

roger_rutishaus · December 2018

hi @lionelderkrikor and @Telcontar120

thank you. now i got the "operator toolbox way" working.
as far as i can see, it can be used to create custom stemming rules. but it doesn't look as if it can be used for dictionary based stemming, right?

@sgenzer have you had the time to look at the issue already?

thanks again everyone involved for your time!

regards, roger

sgenzer · December 2018

hi @roger_rutishaus looks like a bug to me. Ran it a few different ways. I'm pushing this to dev team. Thank you for the report. Meanwhile use of workaround with Operator Toolbox looks like the way to go.

Scott

roger_rutishaus · December 2018

Thanks @sgenzer.
I don't think Operator Toolbox is a way to go, as I can't find a way to use dictionary based stemming with that process (only rule based stemming).
So I am looking forward for a solution with the "Stem (Dictionary)" process :-)
Roger

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Why is Stem (Dictionary) not working?

Declined · Last Updated October 2019

Comments