The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
StopwordfilterFile
nguyenxuanhau
Member Posts: 22 Contributor II
Im using operator StopwordFilterFile but this operator don't work with many stop word as : với, ới, tời, đỗ
my file xml as following:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.6">
<operator name="Root" class="Process" expanded="yes">
<description text="Text Hau"/>
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="graphics" value="../../data/dulieu"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value=""/>
<parameter key="prune_below" value="1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="false"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="false"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="false"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="../../data/dulieu/stopword/stopword.dat"/>
<parameter key="case_sensitive" value="false"/>
</operator>
</operator>
</operator>
</process>
The stopword file contains stop words one per line.
to use operator StopwordFilterFile, what do i do?
Greetings!
my file xml as following:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.6">
<operator name="Root" class="Process" expanded="yes">
<description text="Text Hau"/>
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="graphics" value="../../data/dulieu"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value=""/>
<parameter key="prune_below" value="1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="false"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="false"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="false"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="../../data/dulieu/stopword/stopword.dat"/>
<parameter key="case_sensitive" value="false"/>
</operator>
</operator>
</operator>
</process>
The stopword file contains stop words one per line.
to use operator StopwordFilterFile, what do i do?
Greetings!
0
Answers
Thanks for posting the process, however most folks now use version 5 and will not be able to load it. Upgrade to commune!
As to your problem, my guess is that it is about the characters in those words, and whether their encoding is correctly set, both in Rapidminer and in the stopword file ( I notice you use both windows-1252 and UTF-8 in your Rapidminer XML ). There are also problems specific to Vietnamese detailed here http://vietunicode.sourceforge.net/main.html . Obviously if letters are differently portrayed texts will not match, but if they are portrayed using the same format throughout then I'd need to look into the source.
Which I don't have, because the Text plugin has also been updated!