Filter Tokens by Content (more than one expression)

ArmMiner · September 2012

Hi all

I want to filter the tokens of my example set with more than one expression.
For example:
keep those examples, which contain "fast" or "delivery" or "again" words.

Is this possible? If yes, with which operator? ???
Thanks!

Skirzynski · September 2012

The operator you are looking for is "Filter Example" with the condition class "attribute_value_filter". In the parameter string you can use regular expressions. Here is a process with just this operator which assumes that the text with your tokens to filter is named "text".



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <process expanded="true" height="568" width="587">
      <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="text = .*again.*|.*delivery.*|.*fast.*"/>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

ArmMiner · September 2012

Thanks for the reply!
Actually, the name "text" it has to be the name of my excel file or?

Skirzynski · September 2012

The name of the attribute for the text content.

ArmMiner · September 2012

Actually, I'm getting an error. Instead of the "text" I wrote "Bewertung" which is the column name of the reviews in my data.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="341" width="756">
      <operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="75">
        <parameter key="connection" value="sqlserver"/>
        <parameter key="query" value="SELECT `Bewertung`&#10;FROM `training_schnell`"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
        <parameter key="prunde_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="100.0"/>
        <list key="specify_weights"/>
        <process expanded="true" height="345" width="774">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=".:,:;:!:?:|:"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="120">
            <parameter key="max_chars" value="9999"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="45" y="210"/>
          <operator activated="true" class="text:stem_german" compatibility="5.2.004" expanded="true" height="60" name="Stem (German)" width="90" x="179" y="30"/>
          <operator activated="false" class="text:filter_tokens_by_content" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="165">
            <parameter key="string" value="schnell "/>
            <parameter key="regular_expression" value="(schnell)"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (German)" to_port="document"/>
          <connect from_op="Stem (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="false" class="text:wordlist_to_data" compatibility="5.2.004" expanded="true" height="76" name="WordList to Data" width="90" x="313" y="210"/>
      <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="Bewertung = .*wieder.*|.*lieferung.*|.*schnell.*"/>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="514" y="165">
        <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Rapid_Test\Klein.xls"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

MariusHelf · September 2012

After the Process Documents operator there is no text attribute anymore! if you had set a breakpoint after that operator, you would have seen that. If you want to filter on a token basis (which is not exactly what you described in the first post), you have to use Filter Tokens by Contents inside process documents.

I have been following this post and your other post for quite some time, and I get the feeling that it may be a good idea to step one step back, leave the rather complicated text processing aside and get used to the common concepts of RapidMiner and data mining with RapidMiner with the help of our tutorials. That will make it much easier for you to assemble your processes, and debug them if anything does not work. There is also a good book available which is even downloadable for free: search for "Data Mining for the Masses" by Matt North. Here the author explains many concepts of data mining on simple, but realistic examples, starting at a very basic level and advancing to more and more complex topics. Most of the chapters use RapidMiner as the platform for doing the exercises.

If you have any further questions you are of course invited to ask for help here on the forums!

All the best,
Marius

ArmMiner · September 2012

Your help is really appreciated and I will get that book. Thanks a lot to this forum, because it's really very nice when experienced users are ready to help.
Actually, I have tried Tokens by Content, but actually I couldn't figure out how to specify more than one expression in the corresponding field, I tried search the sample syntax, but no result. :-\

MariusHelf · October 2012

You can "combine" several regular expressions with the vertical bar, e.g.

.*dog.*|.*cat.*|.*fish.*

would match doghouse, catfish, fish food, and anything else containing one of the words.
Regular expressions are quite complex, and there exist complete books on only this topics. The basic syntax however can be quickly learned from tutorials on the internet.

RapidMiner contains a regular expression dialog, where you can directly test the expressions you entered. It is available in many parameters where you can enter regular expressions, e.g. Select Attributes (with attribute_filter_type = regular_expression). Since the dialog is quite new, not all fields have been ported, so its not yet in Filter Tokens. There are also many free regex testers on the internet.

Happy Mining!
~Marius

ArmMiner · October 2012

Hi Marius

Thanks for the help. I will try and give the feedback!

Best regards
Armen

MariusHelf · October 2012

I forgot to set the "condition" in Filter Tokens to "matches" - only if you do that, you can use regular expressions, and you also get the dialog oO

rajbanokhan · December 2018

hi

you use this example it work but can you tell me in this expression

*dog.*|.*cat.*|.*fish.*

what is the meaning of * and . which are use in this expression.

Telcontar120 · December 2018

In regex, the "." is the wildcard character. The "*" is a special code that indicates 'zero or more of the preceding character' so ".*" is basically the expression for anything. So the expression above looks for anything that contains the string "dog" or "cat" or "fish" anywhere in the token.

rajbanokhan · December 2018

thank you so much i got it.

for regular expression can you suggest me a book

Telcontar120 · December 2018

"Regular Expressions in 10 Minutes" by Ben Forta is a good introduction and it is available on Amazon for low cost.

rajbanokhan · December 2018

hi sir

i have a problem when i use regular expression with match condition

*dog.*|.*cat.*|.*fish.*

in result only dog and cat were come. the third one (fish) were not showing in result

kayman · December 2018

To avoid the obvious, there are cases that contain fish, or do you have cases that contain for instance both cat and fish?

The used regular expression is a bit on the greedy side, meaning you can have a lot of results, but not the right ones depending on how your text is structured.

In the given example it will only match the exact case, so if you have for instance Fish (with capital F) , it will not match. it will also capture fishing, hotdog, category and so on, and while that might be useful for some scenarios it can also lead to unexpected results again.

There are ways to improve this, using some of the more advanced yet cool options of regular expressions.

you could use groups to start with, that reduces the wildchards already and makes it more readable and less error prone.

The above becomes then

.*(cat|dog|fish).*

It does exactly the same, it reads as 'take whatever you want (the dot), as many times as you like (the asterix) followed by either cat, dog or fish, and then again followed by whatever as much as you want.

This is what we call a greedy pattern, we don't care of what we get and how much we have. This si typically no problem when dealing with small sentences, but can cost you a lot of memory when you have long content.

so one small improvement already :

.*?(cat|dog|fish).*^

Ok, 2 small changes. The first is the 'hat' (^), which means, begin at the start of the sentence, and the question mark, which means 'end at the first match. So using ^.*? is short for begin at the start, and end as soon as you find the first match. This can save quite some time again with large texts, as the original one will just keep looking for matches untill he is at the end of the sentence.

Now, we still can only match lower case, and while it is good practice to set all of your cases either lower or upper in a text analysis workflow, there are occasions where we need the difference of course. Anyway, to ignore cases we use the i flag as follows :

^.*?(cat|dog|fish).*(?i)

So now it will find cat, Cat, CAT, and whatever else. Should that be a requirement of course.

You can combine many flags together, while the i flag means ignore case, the m flag can be used to indicate you can have multiple lines. combining them as below would mean that every sentence, when using line breaks, would get the same treatment.

^.*?(cat|dog|fish).*</code><code>(?im)

the order doesn't matter, (?mi) would work exactly the same.

Now, we still have the problem we can get things like category or hotdog in the results, so the final part would be to use the word boundary, so that we are ensured we only get a match when it is exactly the same word. A word boundary can be anything like a comma, a dot, a space, end or beginning of sentence etc. Luckily there is a little helper again, so the below will give you an exact match, stop at the first match, looking at every line you have.

(?im)^.*?\b(cat|dog|fish)\b.*</code><code>

As an alternative you could also use the s flag when you have a lot of line breaks, and this will ignore all linebreaks and treat your text as one single line.

^.*?\b(cat|dog|fish)\b.*</code><code>(?is)

FINAL EDIT : it seems the code block screws the content a bit up, all of the symbols used need to be in one single line.

rajbanokhan · December 2018

thank you so much for giving me such nice advice and i try it it work right.

rajbanokhan · December 2018

hi sir

i have a question and its about operator name. the operator filter token by content. can i say we are searching for words by using this operator or find some thing specific and said how i mention it. does these words are right to saying that "for specific or searching".

rajbanokhan · September 2019

hi sir

how i write regular expression for matching all tokens. for example i have two documents and words list of document 1 match to document 2. and the words in document 2 which are not match donnt appear in result.

kayman · September 2019

If you have different wordlists you might try the join operators. Convert your wordlists to data, link your word attributes and inner join will return the ones you have in both, and if you use the Set Minus operator you can filter on the words that appear in one set but not in the other.

Regex is probably not going to work here if that is what you want to achieve.

rajbanokhan · September 2019

hi sir

thank you for giving me suggestion. actually i am working on regular expression that's why i have concern with regular expression. i am using filter token by content using match statement. if there is a way of regular expression. if not its ok.

kayman · September 2019

Would you have some examples you can share? This might make it easier to understand the actual problem

rajbanokhan · September 2019

in filter token by content i used regular expression with match option. it work for select specific words but list of specific words are too large (200- 300 words) so the above regular expression doesnt fit on it. so i try to match one list to another, i hope i convey my message.

rajbanokhan · December 2019

hi sir
i have a question where we write regular regular expression what that box called in rapidminer?

kayman · January 2020

all replace operators support regex.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Filter Tokens by Content (more than one expression)

Answers