The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"filter by upper case letter?"

erocoarerocoar Member Posts: 6 Contributor II
edited June 2019 in Help
Hey there,

I just recently installed Rapid Miner for a university project. I only worked with R so far so this is quite new and challenging for me.
I want to extract text from newspaper frontpages as part of analyzing agenda setting in German politics.

My question would be if it is possible to filter by upper case letter... German nouns start with upper case and I would like to filter that. Unfortunately, I have no idea how to do that. Any help is appreciated :)
Tagged:

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    It's a bit early for me today, but you should be able to do it with Filter Tokens & a regular expression. 

    Don't be scared of regular expressions this one is especially straightforward.
    - ^ means start at the beginning of the text, as you are filtering within the tokens the start should be
    - [A-Z] means any uppercase letter between A & Z
    - . dot means any character at all.
    - * asterix means any number of the preceding element (in this case . )

    Have a play with the example below, simply copy & paste the XML into the XML view of RapidMiner and press the green tick to load it. 
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="210">
            <parameter key="text" value="this is Some text with Capital Letters and mixed with nonCapital letters. "/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="120">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120">
                <parameter key="mode" value="linguistic tokens"/>
                <parameter key="language" value="German"/>
              </operator>
              <operator activated="true" class="text:filter_tokens_by_content" compatibility="6.4.001" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="120">
                <parameter key="condition" value="matches"/>
                <parameter key="regular_expression" value="^[A-Z].*"/>
                <parameter key="case_sensitive" value="true"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
              <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi erocoar,

    if you are interested in german nouns, you can use Filter POS as well. There you can specifically search for Nouns, Adjectives etc. German and English are supported. The process below uses it to get nouns out of the document. Of course you can use this in Process Documents. Further details on the syntax is available on: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html

    ~Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.5.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
            <parameter key="text" value="Dies ist ein Test."/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="6.5.000" expanded="true" height="60" name="Tokenize" width="90" x="246" y="165"/>
          <operator activated="true" class="text:filter_tokens_by_pos" compatibility="6.5.000" expanded="true" height="60" name="Filter Tokens (by POS Tags)" width="90" x="447" y="165">
            <parameter key="language" value="German"/>
            <parameter key="expression" value="NN"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
          <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • erocoarerocoar Member Posts: 6 Contributor II
    Oh amazing! Thank you so much :) This really helps a lot. JEdward, how did you manage to turn filter tokens from string to regular expression?
Sign In or Register to comment.