The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Dictionary Spanish (text mining)"

ronel74ronel74 Member Posts: 2 image Contributor I
edited June 2019 in Help
Hi, I recently started to use rapidminer and I am having troubles with some operators regarding text processing, because the language that I am working with is spanish.

The operators that I would like to use are:

Stemming
tokenize linguistic
filter stopwords

Are these operators available for spanish texts. ??

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 image RM Data Scientist
    The snowball stemming supports spanish
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 image Contributor II
    Still no Filter Stopwords available in Spanish though, right? :(

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 image Unicorn
    Actually there are Spanish stopwords you can download from the internet and add to your process using the Filter Stopwords (Dictionary). 
    Just follow the operator documentation and create a file with one Spanish word per line and use that. 

    Here's a short example using the stopwords listed here: http://www.ranks.nl/stopwords/spanish
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Root">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="1969"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="7.0.000" expanded="true" height="68" name="Spanish Stopwords" width="90" x="45" y="187">
            <parameter key="text" value="un&#10;una&#10;unas&#10;unos&#10;uno&#10;sobre&#10;todo&#10;también&#10;tras&#10;otro&#10;algún&#10;alguno&#10;alguna&#10;algunos&#10;algunas&#10;ser&#10;es&#10;soy&#10;eres&#10;somos&#10;sois&#10;estoy&#10;esta&#10;estamos&#10;estais&#10;estan&#10;como&#10;en&#10;para&#10;atras&#10;porque&#10;por qué&#10;estado&#10;estaba&#10;ante&#10;antes&#10;siendo&#10;ambos&#10;pero&#10;por&#10;poder&#10;puede&#10;puedo&#10;podemos&#10;podeis&#10;pueden&#10;fui&#10;fue&#10;fuimos&#10;fueron&#10;hacer&#10;hago&#10;hace&#10;hacemos&#10;haceis&#10;hacen&#10;cada&#10;fin&#10;incluso&#10;primero&#10;desde&#10;conseguir&#10;consigo&#10;consigue&#10;consigues&#10;conseguimos&#10;consiguen&#10;ir&#10;voy&#10;va&#10;vamos&#10;vais&#10;van&#10;vaya&#10;gueno&#10;ha&#10;tener&#10;tengo&#10;tiene&#10;tenemos&#10;teneis&#10;tienen&#10;el&#10;la&#10;lo&#10;las&#10;los&#10;su&#10;aqui&#10;mio&#10;tuyo&#10;ellos&#10;ellas&#10;nos&#10;nosotros&#10;vosotros&#10;vosotras&#10;si&#10;dentro&#10;solo&#10;solamente&#10;saber&#10;sabes&#10;sabe&#10;sabemos&#10;sabeis&#10;saben&#10;ultimo&#10;largo&#10;bastante&#10;haces&#10;muchos&#10;aquellos&#10;aquellas&#10;sus&#10;entonces&#10;tiempo&#10;verdad&#10;verdadero&#10;verdadera&#10;cierto&#10;ciertos&#10;cierta&#10;ciertas&#10;intentar&#10;intento&#10;intenta&#10;intentas&#10;intentamos&#10;intentais&#10;intentan&#10;dos&#10;bajo&#10;arriba&#10;encima&#10;usar&#10;uso&#10;usas&#10;usa&#10;usamos&#10;usais&#10;usan&#10;emplear&#10;empleo&#10;empleas&#10;emplean&#10;ampleamos&#10;empleais&#10;valor&#10;muy&#10;era&#10;eras&#10;eramos&#10;eran&#10;modo&#10;bien&#10;cual&#10;cuando&#10;donde&#10;mientras&#10;quien&#10;con&#10;entre&#10;sin&#10;trabajo&#10;trabajar&#10;trabajas&#10;trabaja&#10;trabajamos&#10;trabajais&#10;trabajan&#10;podria&#10;podrias&#10;podriamos&#10;podrian&#10;podriais&#10;yo&#10;aquel"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Create a file of these words" width="90" x="179" y="187">
            <parameter key="overwrite" value="true"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
            <parameter key="file" value="myFile.txt"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34">
            <parameter key="case_sensitive" value="false"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_op="Spanish Stopwords" from_port="output" to_op="Create a file of these words" to_port="document"/>
          <connect from_op="Create a file of these words" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • ClaraCabaClaraCaba Member Posts: 9 image Contributor II
    Thank you very much, I did that and it worked perfectly.
Sign In or Register to comment.