Similarity between data

online360 · March 2016

Hi everyone!

I have a data set that contains about 250.000 products, consisting of various columns like "artid", "title", "longtext" and so on.

Now, I want to find similar products to each product, where the result should look like:
artid; similar1-artid; simliar2-artid; and so on.

For this, I'd like to select the columns that should be analyzed and I'd like to set a "limit of similarity" that tells rapidminer when to write the artid of a similar product into the results list (next to each product) and when to ignore it.

I had a look on many video tutorials, dealing with text classifcation but none of them told me on how to create such a dataset. (listing each product again together with the artid of the similiar products)
I also tried "data to similarity" but it fails to display the results, even if I filter for 1 % of the data.

Does anyone have an idea on that?

Many thanks in advance!

RalfKlinkenberg · March 2016

The RapidMiner operator "Cross-Distances" should deliver the desired result for you, if you feed the your data to both of the input ports of the operator.

MartinLiebig · March 2016

And to add: please be aware what happens with 250k of entries. you get (250k*250k)/2 pairs. Sounds like a lot!

online360 · March 2016

Hi!

Many thanks for your input!

Well, I just ran the process (which took about 3 hours on a 16 gb ram machine, even after splitting the data) and received a result containing four columns:
Row Number; request; document; distance

request seems to be the artid which I set to role "id", document contains numbers between 1 and to and distance is always filled with "?".

I thought that I'd receive something like:
id; "simliar id"; percentage

Regarding the (250K*250K)/2 rows:
Wouldn't it be possible to get the exact same number of rows as the input data set contains and just add the similar artids in each row (while each of them is a product) as a new column?

Is there anything wrong with my code?:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Local Repository/data/t123_product"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="82" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.05"/>
        </enumeration>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="langtext|artid|bezeichnung_3"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.0.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="85">
        <parameter key="attribute_name" value="artid"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="7.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="715" y="85">
        <parameter key="measure_types" value="NominalMeasures"/>
        <parameter key="only_top_k" value="true"/>
        <parameter key="compute_similarities" value="true"/>
      </operator>
      <connect from_op="Retrieve t123_product" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Set Role" from_port="original" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks!

JEdward · March 2016

If you use Similarity to Data you can select Matrix to get them as a table.

So 250k x 250k.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Local Repository/data/t123_product"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="82" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.05"/>
        </enumeration>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="langtext|artid|bezeichnung_3"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.0.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="85">
        <parameter key="attribute_name" value="artid"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="648" y="85">
        <parameter key="measure_types" value="NominalMeasures"/>
      </operator>
      <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="782" y="85">
        <parameter key="table_type" value="matrix"/>
      </operator>
      <connect from_op="Retrieve t123_product" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
      <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
      <connect from_op="Similarity to Data" from_port="exampleSet" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

online360 · March 2016

Hi!

Many thanks for your input regarding the similarity to data-operator.
Unfortunately this exceeds my RAM (16 GB) even if I filter the data to 0.005.

Is there any chance to do this analysis without needing a 3-digit RAM server?

Thanks!

marcin_blachnik · March 2016

You should be able to do it with your computer, but I think you have a huge mistake in your process as you apply distance measure to "langtext" which is nominal, so RM uses nominal distance. Nominal distance checks if two nominal values are identical or not, so you may think of it as comparing tow strings if they are equal. In other words to get distance=0 you'd need to have two identical text descriptions. What you should do is to apply text mining extension to convert text description into numerical values, then you can execute cross distance operator, using appropriate numerical distance measure. You can also set to top k to 3 to get 3 most similar products.

The output of cross distance is exactly what you need as it returns three columns id of the product (request column), id of the most similar product (document attribute) and the distance between these two products (so some measure how these two products are similar).
Some additional comments - in your scenario I wouldn't set compute similarities parameter as you obtain three most dissimilar products.
To analyse you results in Results View perspective click on request column so the results will be sorted by the request column. Then you'd see that each product appears only 3 times (as you select top 3 products) and the document column would contain the most similar products and the last column is the distance.
I also suggest not use Data2Similarity operator as it is not necessary and very memory consuming operator. In presented above scenario the result of cross distance operator would consume 250k*3(number of k)*3(number of columns)*8(size of double type) so about 18MB of RAM to store your results. THat is much less then 16GB, but fist you have to correct your process.

Best

online360 · March 2016

Hi Marcin!

By saying

You should be able to do it with your computer, but I think you have a huge mistake in your process as you apply distance measure to "langtext" which is nominal, so RM uses nominal distance. Nominal distance checks if two nominal values are identical or not, so you may think of it as comparing tow strings if they are equal. In other words to get distance=0 you'd need to have two identical text descriptions. What you should do is to apply text mining extension to convert text description into numerical values, then you can execute cross distance operator, using appropriate numerical distance measure. You can also set to top k to 3 to get 3 most similar products.

Do you mean something like in the following code?:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Local Repository/data/t123_product"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="82" name="Split Data" width="90" x="112" y="289">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.05"/>
        </enumeration>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="187">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="bezeichnung_3|langtext"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34"/>
      <operator activated="true" class="set_role" compatibility="7.0.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="artid"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="7.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="715" y="85">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="only_top_k" value="true"/>
        <parameter key="k" value="5"/>
      </operator>
      <connect from_op="Retrieve t123_product" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Set Role" from_port="original" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

This was the only version that ran for at least a few minutes before crying for more RAM.
But it also just stopped working (without any notice) after half an hour.

Thanks!

marcin_blachnik · March 2016

Your process is incorrect, becouse you can't use Nominal2Numerical operator. It would create 250k new attributes and you run out of memory. Moreover you wouldn't get what you expected. Please read carefuly the documentation of Nominal2Numericl operator.
As I wrote you have to (!!!) use text processing or text mining extension to convert your product text descriptions to some meaningful numbers.
I can suggest to look at the youtube chanel to learn about text mining in RM, and you'll be able to solve your problem.
In your process you have to replace Nominal2Numerical operator with the Process Documents form Data operator. Before you would have to convert nominal to text or re-read your data and set correct attribute type for longtext attribute.
If you run such process for 250k documents it would take some time but your computer would be able to do it without any problems.

online360 · March 2016

Hello!

Please excuse my late reply.

After watching a few videos and reading an article on another website, I managed receiving results when using the "data to similarity" operator.
But when I use "cross distance", I only get "?" in the "distance" column of the results. (I especially don't know if my connections are correct.

process with "data to similarity" (working):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product_import_23032016" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//tech123_win/t123_product_import_23032016"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="description"/>
        <parameter key="attributes" value="sku|description"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="82" name="Split Data" width="90" x="313" y="34">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.01"/>
          <parameter key="ratio" value="0.99"/>
        </enumeration>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="289">
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
          <operator activated="true" class="text:transform_cases" compatibility="7.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="85"/>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.0.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="380" y="85"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="7.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="447" y="289">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
        <parameter key="only_top_k" value="true"/>
        <parameter key="k" value="5"/>
      </operator>
      <connect from_op="Retrieve t123_product_import_23032016" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Nominal to Text" from_port="original" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

process with "cross distance" (not working):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product_import_23032016" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//tech123_win/t123_product_import_23032016"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="description"/>
        <parameter key="attributes" value="sku|description"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="82" name="Split Data" width="90" x="313" y="34">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.001"/>
          <parameter key="ratio" value="0.999"/>
        </enumeration>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="289">
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
          <operator activated="true" class="text:transform_cases" compatibility="7.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="85"/>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.0.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="380" y="85"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="447" y="340">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <connect from_op="Retrieve t123_product_import_23032016" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
      <connect from_op="Data to Similarity" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Does anyone have an idea why v1 is working and v2 not?

Thank you very much!

online360 · March 2016

Hello!

After adding the "multiply" operator between "process documents" and "cross distance", I'm finally getting results! ;D
I don't know if this is the correct way to feed the ref port of "cross distance", but I hope so.

I now only have two questions left:

1.,
Can I make the result look like:
request;document
productX;productA,productB,productC

Instead of
request;document
productX;productA
productX;productB
productX; productC

2.,
What do similiarity-score like 0.XX mean?
Isn't it possible to show anything between 1.0 and 0.0 (meaning percentages?)
It the moment I get something between 0 and 1.55 (small sample set)

Thank you very much!

JEdward · March 2016

I now only have two questions left:

1.,
Can I make the result look like:
request;document
productX;productA,productB,productC

Yes, the Aggregate operator will do that for you. See this example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="7.0.001" expanded="true" height="103" name="Subprocess" width="90" x="112" y="85">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="179" y="30">
            <list key="attribute_values">
              <parameter key="attribute1" value="1"/>
              <parameter key="attribute2" value="2"/>
              <parameter key="attribute3" value="3"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="165">
            <list key="attribute_values">
              <parameter key="attribute1" value="1"/>
              <parameter key="attribute2" value="2"/>
              <parameter key="attribute3" value="3"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="179" y="238">
            <list key="attribute_values">
              <parameter key="attribute1" value="4"/>
              <parameter key="attribute2" value="5"/>
              <parameter key="attribute3" value="6"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="7.0.001" expanded="true" height="82" name="Generate ID" width="90" x="514" y="30">
            <parameter key="create_nominal_ids" value="true"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="179" y="289">
            <list key="attribute_values">
              <parameter key="attribute1" value="7"/>
              <parameter key="attribute2" value="8"/>
              <parameter key="attribute3" value="6"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (5)" width="90" x="179" y="340">
            <list key="attribute_values">
              <parameter key="attribute1" value="4"/>
              <parameter key="attribute2" value="8"/>
              <parameter key="attribute3" value="6"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (6)" width="90" x="179" y="442">
            <list key="attribute_values">
              <parameter key="attribute1" value="100"/>
              <parameter key="attribute2" value="5"/>
              <parameter key="attribute3" value="6"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (7)" width="90" x="179" y="544">
            <list key="attribute_values">
              <parameter key="attribute1" value="100"/>
              <parameter key="attribute2" value="100"/>
              <parameter key="attribute3" value="6"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.0.001" expanded="true" height="187" name="Append" width="90" x="313" y="210"/>
          <operator activated="true" class="generate_id" compatibility="7.0.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="514" y="210">
            <parameter key="create_nominal_ids" value="true"/>
          </operator>
          <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Generate ID" from_port="example set output" to_port="out 1"/>
          <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 3"/>
          <connect from_op="Generate Data by User Specification (5)" from_port="output" to_op="Append" to_port="example set 4"/>
          <connect from_op="Generate Data by User Specification (6)" from_port="output" to_op="Append" to_port="example set 5"/>
          <connect from_op="Generate Data by User Specification (7)" from_port="output" to_op="Append" to_port="example set 6"/>
          <connect from_op="Append" from_port="merged set" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_port="out 2"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="162"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="7.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="313" y="85">
        <parameter key="numerical_measure" value="KernelEuclideanDistance"/>
        <parameter key="only_top_k" value="true"/>
        <parameter key="k" value="3"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.0.001" expanded="true" height="82" name="Aggregate" width="90" x="447" y="34">
        <list key="aggregation_attributes">
          <parameter key="document" value="concatenation"/>
        </list>
        <parameter key="group_by_attributes" value="request"/>
      </operator>
      <connect from_op="Subprocess" from_port="out 1" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Subprocess" from_port="out 2" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

2.,
What do similiarity-score like 0.XX mean?
Isn't it possible to show anything between 1.0 and 0.0 (meaning percentages?)
It the moment I get something between 0 and 1.55 (small sample set)

I'm not 100% sure your example here, but it sounds like Normalize is the operator you are looking for here.

online360 · March 2016

Hi!

Number 2 seems to work, thanks!

For problem 1:
Setting "aggregation attributes" to "document" and "concatenation" doesn't work as it says:
"The value type of the attribute is not compatible with the aggregation function "concatination".
Is the another aggregation function that works or is there another operator that has to be put in front of "aggregate"?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve t123_product_import_23032016" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Local Repository/data/t123_product_import_23032016"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="description"/>
        <parameter key="attributes" value="sku|description|etim|manufacturer|teg_prodnumber|short_description"/>
      </operator>
      <operator activated="true" class="trim" compatibility="7.0.001" expanded="true" height="82" name="Trim" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="etim"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="etim.equals.EC000374"/>
        </list>
      </operator>
      <operator activated="false" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="782" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="manufacturer.equals.Gira"/>
        </list>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="238"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="391">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="7.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="85"/>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="7.0.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="447" y="85"/>
          <operator activated="true" class="text:stem_snowball" compatibility="7.0.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="581" y="85">
            <parameter key="language" value="German"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.0.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="391"/>
      <operator activated="true" class="cross_distances" compatibility="7.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="648" y="391">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
        <parameter key="only_top_k" value="true"/>
      </operator>
      <operator activated="true" class="normalize" compatibility="7.0.001" expanded="true" height="103" name="Normalize" width="90" x="849" y="238">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="distance"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.0.001" expanded="true" height="82" name="Aggregate" width="90" x="1117" y="238">
        <list key="aggregation_attributes">
          <parameter key="document" value="concatenation"/>
        </list>
        <parameter key="group_by_attributes" value="request"/>
      </operator>
      <connect from_op="Retrieve t123_product_import_23032016" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Trim" to_port="example set input"/>
      <connect from_op="Trim" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Similarity between data

Answers