The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Dictionary Spanish (text mining)"

ronel74ronel74 Member Posts: 2 Contributor I
edited June 2019 in Help
Hi, I recently started to use rapidminer and I am having troubles with some operators regarding text processing, because the language that I am working with is spanish.

The operators that I would like to use are:

Stemming
tokenize linguistic
filter stopwords

Are these operators available for spanish texts. ??

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    The snowball stemming supports spanish
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Still no Filter Stopwords available in Spanish though, right? :(

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Actually there are Spanish stopwords you can download from the internet and add to your process using the Filter Stopwords (Dictionary). 
    Just follow the operator documentation and create a file with one Spanish word per line and use that. 

    Here's a short example using the stopwords listed here: http://www.ranks.nl/stopwords/spanish
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Root">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="1969"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="7.0.000" expanded="true" height="68" name="Spanish Stopwords" width="90" x="45" y="187">
            <parameter key="text" value="un&#10;una&#10;unas&#10;unos&#10;uno&#10;sobre&#10;todo&#10;también&#10;tras&#10;otro&#10;algún&#10;alguno&#10;alguna&#10;algunos&#10;algunas&#10;ser&#10;es&#10;soy&#10;eres&#10;somos&#10;sois&#10;estoy&#10;esta&#10;estamos&#10;estais&#10;estan&#10;como&#10;en&#10;para&#10;atras&#10;porque&#10;por qué&#10;estado&#10;estaba&#10;ante&#10;antes&#10;siendo&#10;ambos&#10;pero&#10;por&#10;poder&#10;puede&#10;puedo&#10;podemos&#10;podeis&#10;pueden&#10;fui&#10;fue&#10;fuimos&#10;fueron&#10;hacer&#10;hago&#10;hace&#10;hacemos&#10;haceis&#10;hacen&#10;cada&#10;fin&#10;incluso&#10;primero&#10;desde&#10;conseguir&#10;consigo&#10;consigue&#10;consigues&#10;conseguimos&#10;consiguen&#10;ir&#10;voy&#10;va&#10;vamos&#10;vais&#10;van&#10;vaya&#10;gueno&#10;ha&#10;tener&#10;tengo&#10;tiene&#10;tenemos&#10;teneis&#10;tienen&#10;el&#10;la&#10;lo&#10;las&#10;los&#10;su&#10;aqui&#10;mio&#10;tuyo&#10;ellos&#10;ellas&#10;nos&#10;nosotros&#10;vosotros&#10;vosotras&#10;si&#10;dentro&#10;solo&#10;solamente&#10;saber&#10;sabes&#10;sabe&#10;sabemos&#10;sabeis&#10;saben&#10;ultimo&#10;largo&#10;bastante&#10;haces&#10;muchos&#10;aquellos&#10;aquellas&#10;sus&#10;entonces&#10;tiempo&#10;verdad&#10;verdadero&#10;verdadera&#10;cierto&#10;ciertos&#10;cierta&#10;ciertas&#10;intentar&#10;intento&#10;intenta&#10;intentas&#10;intentamos&#10;intentais&#10;intentan&#10;dos&#10;bajo&#10;arriba&#10;encima&#10;usar&#10;uso&#10;usas&#10;usa&#10;usamos&#10;usais&#10;usan&#10;emplear&#10;empleo&#10;empleas&#10;emplean&#10;ampleamos&#10;empleais&#10;valor&#10;muy&#10;era&#10;eras&#10;eramos&#10;eran&#10;modo&#10;bien&#10;cual&#10;cuando&#10;donde&#10;mientras&#10;quien&#10;con&#10;entre&#10;sin&#10;trabajo&#10;trabajar&#10;trabajas&#10;trabaja&#10;trabajamos&#10;trabajais&#10;trabajan&#10;podria&#10;podrias&#10;podriamos&#10;podrian&#10;podriais&#10;yo&#10;aquel"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Create a file of these words" width="90" x="179" y="187">
            <parameter key="overwrite" value="true"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
            <parameter key="file" value="myFile.txt"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34">
            <parameter key="case_sensitive" value="false"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_op="Spanish Stopwords" from_port="output" to_op="Create a file of these words" to_port="document"/>
          <connect from_op="Create a file of these words" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Thank you very much, I did that and it worked perfectly.
Sign In or Register to comment.