Words/String Matching Producing true or false

asn4293 · April 2018

I have a data set for example:

Internal Experience Functional Area

Marketing & Sales

Controlling/Accounting

Marketing & Sales|Marketing & Sales

General Management

Marketing & Sales

Logistics|Logistics|Logistics|Logistics

Logistics

Marketing & Sales

I want to match it with my requirement xlsx file which contain column:

Match words

sales

This matching is string and is not case sensitive meaning even if it is small letters and capital it should work.After matching it should give me result as true or false or 1 or 0. Result should look like this.

Internal Experience Functional Area	Matching result
Marketing & Sales	TRUE
Marketing & Sales	TRUE
Controlling/Accounting	FALSE
Marketing & Sales\|Marketing & Sales	TRUE
General Management	FALSE
Marketing & Sales	TRUE
Logistics\|Logistics\|Logistics\|Logistics	FALSE
Logistics	FALSE
Marketing & Sales	TRUE

I dont know how it can be done. please help

kypexin · April 2018

Hi @asn4293

Let's assume that 'Area' is a short name for the attribute containing strings.

Use 'Generate Attributes' operator to create new attribute named 'MatchingResult', with the following parameters:

attribute name: MatchingResult

function expressions: contains(lower([Area]), 'sales')

This would generate 'true' value in case lowercase 'Area' contains 'sales' substring, and 'false' otherwise.

Screenshot 2018-04-26 08.18.20.png

asn4293 · April 2018

@kypexin

Thank you for your feedback, but this is only reasonable when we have one search and we can write query everytime, I have approximately 1000 things to match with huge data, in that case this would not be a suitable case.

I want to specify column where there are words to be matched with each other.

kypexin · April 2018

Hi @asn4293

So the task becomes much more generalized, where you have to fuzzy match two columns of text attributes, which technically makes many-to-many matching. This sounds like a bit tricky task to be acomplished with RapidMiner, at least I cannot come up with an easy solution right out of my head... However, my suggestions are:

Have a look at a very ionteresting trick from @BalazsBarany website on how to perform generic joins in RM and maybe this can give you some inspiration: https://datascientist.at/2016/06/generic-joins-in-rapidminer/#english
Maybe also you should consider some Python script to accomlish this task which at the end might be much faster and simpler in implementation.

If you could share your actual files you need to match, we could probably try to play around with these to get a faster solution with RM.

SGolbert · April 2018

Hi @asn4293,

I may have found a solution playing around with Process Documents from data (from the Text Processing Extension):

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="103" name="Execute R" width="90" x="112" y="34">
        <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function()&#10;{&#10;    &#10;    data2 &lt;- data.table(Area = c(&quot;MArketing &amp; SaleS&quot;, &quot;Controlling/Accounting&quot;,&#10;    &#9;&#9;&#9;&#9;&#9;&#9;&quot;Logistics&quot;))&#10;&#10;    words = data.table(Match = c(&quot;sales&quot;, &quot;logistics&quot;))&#10;    &#10;    # connect 2 output ports to see the results&#10;    return(list(data2, words))&#10;}&#10;"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="238">
        <list key="function_descriptions">
          <parameter key="OriginalText" value="Area"/>
        </list>
      </operator>
      <operator activated="true" class="remember" compatibility="8.1.003" expanded="true" height="68" name="Remember" width="90" x="380" y="34">
        <parameter key="name" value="match_words"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Area" value="1.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"/>
          <operator activated="true" class="recall" compatibility="8.1.003" expanded="true" height="68" name="Recall" width="90" x="380" y="187">
            <parameter key="name" value="match_words"/>
            <parameter key="remove_from_store" value="false"/>
          </operator>
          <operator activated="true" class="operator_toolbox:filter_tokens_using_exampleset" compatibility="1.0.000" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="648" y="34">
            <parameter key="attribute" value="Match"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens Using ExampleSet" to_port="document"/>
          <connect from_op="Recall" from_port="result" to_op="Filter Tokens Using ExampleSet" to_port="example set"/>
          <connect from_op="Filter Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="715" y="238">
        <list key="function_descriptions">
          <parameter key="Match" value="if(text == &quot;&quot;, &quot;False&quot;, &quot;True&quot;)"/>
        </list>
      </operator>
      <connect from_op="Execute R" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Execute R" from_port="output 2" to_op="Remember" to_port="store"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Remember" from_port="stored" to_port="result 2"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Note that I generated a couple of test example sets with R, but that's only for my convenience (R is not at all necessary). The idea is to tokenize the string, then filter only the tokens matching the keywords and then proof whether the resulting string is empty.

I leave it up to you to refactor this "quick and dirty" solution XD

Kind regards,

Sebastian

kypexin · April 2018

@SGolbert pretty neat!

asn4293 · May 2018

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="103" name="Execute R" width="90" x="112" y="34">
        <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function()&#10;{&#10;    &#10;    data2 &lt;- data.table(Area = c(&quot;MArketing &amp; SaleS&quot;, &quot;Controlling/Accounting&quot;,&#10;    &#9;&#9;&#9;&#9;&#9;&#9;&quot;Logistics&quot;))&#10;&#10;    words = data.table(Match = c(&quot;sales&quot;, &quot;logistics&quot;))&#10;    &#10;    # connect 2 output ports to see the results&#10;    return(list(data2, words))&#10;}&#10;"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="238">
        <list key="function_descriptions">
          <parameter key="OriginalText" value="Area"/>
        </list>
      </operator>
      <operator activated="true" class="remember" compatibility="8.1.003" expanded="true" height="68" name="Remember" width="90" x="380" y="34">
        <parameter key="name" value="match_words"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Area" value="1.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"/>
          <operator activated="true" class="recall" compatibility="8.1.003" expanded="true" height="68" name="Recall" width="90" x="380" y="187">
            <parameter key="name" value="match_words"/>
            <parameter key="remove_from_store" value="false"/>
          </operator>
          <operator activated="true" class="operator_toolbox:filter_tokens_using_exampleset" compatibility="1.0.000" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="648" y="34">
            <parameter key="attribute" value="Match"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens Using ExampleSet" to_port="document"/>
          <connect from_op="Recall" from_port="result" to_op="Filter Tokens Using ExampleSet" to_port="example set"/>
          <connect from_op="Filter Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="715" y="238">
        <list key="function_descriptions">
          <parameter key="Match" value="if(text == &quot;&quot;, &quot;False&quot;, &quot;True&quot;)"/>
        </list>
      </operator>
      <connect from_op="Execute R" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Execute R" from_port="output 2" to_op="Remember" to_port="store"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Remember" from_port="stored" to_port="result 2"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

I dont know how to put R coding in can you please help to rectify it @SGolbert. One file has data in it, second file it is getting data from.

Data file
This file is the drop down which is data to look into

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Words/String Matching Producing true or false

Answers