The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Why is Stem (Dictionary) not working?
roger_rutishaus
Member Posts: 8 Contributor II
Hi,
I use "Stem (Dictionary)", to which i connected "Open File" that loads a .txt file.
In the txt file are the entries, like:
The stemmer does not work. The wordlist results still delivers "jugendlichen" instead of "jugendlich".
What am I doing wrong? Thanks for your help!
Roger
complete settings:
I use "Stem (Dictionary)", to which i connected "Open File" that loads a .txt file.
In the txt file are the entries, like:
jugendlich:jugendlich jugendliche jugendlichem jugendlichen jugendlicher jugendliches
jugendpflegerisch:jugendpflegerisch jugendpflegerische jugendpflegerischem jugendpflegerischen jugendpflegerischer jugendpflegerisches
jugoslawisch:jugoslawisch jugoslawische jugoslawischem jugoslawischen jugoslawischer jugoslawisches
jung:jung junge jungem jungen
The stemmer does not work. The wordlist results still delivers "jugendlichen" instead of "jugendlich".
What am I doing wrong? Thanks for your help!
Roger
complete settings:
<div class="Spoiler"><pre class="CodeBlock"><?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (2)" width="90" x="45" y="34">
<parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Web"/>
<parameter key="recursive" value="true"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34">
<parameter key="extract_text_only" value="false"/>
<parameter key="content_type" value="html"/>
<parameter key="encoding" value="UTF-8"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">HTML-Dateien</description>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.0.003" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="45" y="34"/>
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="body" value="<body.*[\s\S]+</body>"/>
</list>
<list key="regular_region_queries">
<parameter key="body" value="<body\.*>.<\\/body>"/>
</list>
<list key="xpath_queries">
<parameter key="inhalt_html-dokumente" value="//h:div[@id="content_center"]//h:div[@class="conttext"][text()]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="112" y="34">
<parameter key="minimum_text_block_length" value="6"/>
</operator>
<operator activated="true" class="text:filter_documents_by_content" compatibility="8.1.000" expanded="true" height="82" name="Filter Documents (by Content)" width="90" x="246" y="34">
<parameter key="condition" value="contains match"/>
<parameter key="regular_expression" value="."/>
</operator>
<connect from_port="segment" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Filter Documents (by Content)" to_port="documents 1"/>
<connect from_op="Filter Documents (by Content)" from_port="documents" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="single" to_op="HTML to XML" to_port="document"/>
<connect from_op="HTML to XML" from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Nur relevanter Text behalten</description>
</operator>
<operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (3)" width="90" x="45" y="187">
<parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Projekt"/>
<parameter key="recursive" value="true"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="34">
<parameter key="encoding" value="UTF-8"/>
</operator>
<connect from_port="file object" to_op="Read Document (2)" to_port="file"/>
<connect from_op="Read Document (2)" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">TXT-Dateien</description>
</operator>
<operator activated="true" class="collect" compatibility="9.0.003" expanded="true" height="103" name="Collect (2)" width="90" x="313" y="136">
<description align="center" color="transparent" colored="false" width="126">Quelldokumente sammeln</description>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="447" y="136">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="99999"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34">
<parameter key="mode" value="regular expression"/>
<parameter key="characters" value=" "/>
<parameter key="expression" value="((-[^a-zA-Z])+)|(([^a-zA-Z]{1,}-)+)|([^a-zA-Zäöü0-9-]+)"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="313" y="34">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="100"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="179" y="136">
<parameter key="file" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\stopwords-de-solariz-small.txt"/>
<parameter key="encoding" value="UTF-8"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="313" y="136"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="136"/>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="313" y="238">
<parameter key="condition" value="contains match"/>
<parameter key="string" value="^[0-9]"/>
<parameter key="regular_expression" value="^[^0-9].*"/>
</operator>
<operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="238">
<parameter key="max_length" value="3"/>
</operator>
<operator activated="false" class="text:filter_tokens_by_pos" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="514" y="340">
<parameter key="language" value="German"/>
<parameter key="expression" value="NE"/>
<parameter key="invert_filter" value="true"/>
</operator>
<operator activated="false" class="text:stem_german" compatibility="8.1.000" expanded="true" height="68" name="Stem (German)" width="90" x="447" y="493"/>
<operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="112" y="544">
<parameter key="filename" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\rogerwordlist3.txt"/>
</operator>
<operator activated="true" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="246" y="442"/>
<operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="648" y="34"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Stem (Dictionary)" to_port="document"/>
<connect from_op="Open File" from_port="file" to_op="Stem (Dictionary)" to_port="file"/>
<connect from_op="Stem (Dictionary)" from_port="document" to_op="Extract Token Number" to_port="document"/>
<connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Dokumente verarbeiten</description>
</operator>
<operator activated="true" class="write_excel" compatibility="9.0.003" expanded="true" height="82" name="Write Excel (2)" width="90" x="514" y="34">
<parameter key="excel_file" value="D:\Dropbox\_BT\Textanalyse\terms-multimediaprod.xlsx"/>
<parameter key="number_format" value="#.000"/>
</operator>
<operator activated="false" class="text:process_documents" compatibility="8.1.000" expanded="true" height="82" name="Process Documents" width="90" x="246" y="595">
<process expanded="true">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Loop Files (2)" from_port="output 1" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Collect (2)" to_port="input 1"/>
<connect from_op="Loop Files (3)" from_port="output 1" to_op="Collect (2)" to_port="input 2"/>
<connect from_op="Collect (2)" from_port="collection" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_op="Write Excel (2)" to_port="input"/>
<connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/>
<connect from_op="Write Excel (2)" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process></pre></div>
Tagged:
0
Comments
But in the meantime, if you want a workaround you can try the Stem Tokens Using Exampleset operator, which allows you to put your desired stemming into a normal dataset. This operator is part of the free Operator Toolbox extension.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Scott
@Telcontar120
i don't know what process you mean (no process found by the name of "stem tokens")
@sgenzer
stemming file is attached.
new, simplyfied process is as follows:
To detail @Telcontar120 's proposition, you have to :
- Go to the MarketPlace and install the Operator Toolbox extension.
- Then follow the instructions in this screenshot :
I hope it helps,
Regards,
Lionel
thank you. now i got the "operator toolbox way" working.
as far as i can see, it can be used to create custom stemming rules. but it doesn't look as if it can be used for dictionary based stemming, right?
@sgenzer have you had the time to look at the issue already?
thanks again everyone involved for your time!
regards, roger
Scott
I don't think Operator Toolbox is a way to go, as I can't find a way to use dictionary based stemming with that process (only rule based stemming).
So I am looking forward for a solution with the "Stem (Dictionary)" process :-)
Roger