Compare Examples within a ExampleSet
Hello together,
The ExampleSet looks like:
Row No. Att1 Att2 Att3
1 A B C
2 A B C
3 A B C
4 D E F
5 D E F
6 A B C
7 D E F
7 D E F
So, what I now want to do is to compare each example with the one in the first row and check if they are similar to each other. If true the Result attribute has to show the same output (here "1" for row 1,2 and 3). This should be continued until the similarity is not true for the first time (here after row 3). After that the process has to be start again but this time the "first row" needs to be the one which was not similar on the previous comparision process (so in this case row 4). The following examples have to be compared with the new "first row" (e.g. row 5 with row 4, row 6 with row4 ... until the next false occures). This time the Result attribute should show the output "2".
And so on, and so on....
It is importend not to change the order of the examples because i need to know how often there is a difference within the ExampleSet.
This is how it should look like in the end:
Row No. Att1 Att2 Att3 Result
1 A B C 1
2 A B C 1
3 A B C 1
4 D E F 2
5 D E F 2
6 A B C 3
7 D E F 4
8 D E F 4
I was trying to solve the problem with the LoopExample and Generate Attribute operator but it didn't really work.
So does anybody has an idea? I have no clue
Many thanks and best regards,
Leo
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Dear all,
@Thomas_Ott : Nice suggestion !
Lag Series was in deed the "key operator" to perform this last task. Thanks for your help.
@Leo_179, Here the new process to apply on your whole dataset to see if it gives relevant results :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.2.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="187">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Compare_Examplesets\Compare_Examplesets.xlsx"/>
<parameter key="imported_cell_range" value="A1:C9"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Att1.true.polynominal.attribute"/>
<parameter key="1" value="Att2.true.polynominal.attribute"/>
<parameter key="2" value="Att3.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="124" name="Multiply (3)" width="90" x="179" y="187"/>
<operator activated="true" class="loop_examples" compatibility="8.2.000" expanded="true" height="103" name="Loop Examples" width="90" x="313" y="136">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="124" name="Multiply" width="90" x="45" y="34"/>
<operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range (3)" width="90" x="45" y="238">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="1"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
<parameter key="first_example" value="%{example}"/>
<parameter key="last_example" value="%{example}"/>
</operator>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="103" name="Append" width="90" x="179" y="136"/>
<operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range (2)" width="90" x="380" y="136">
<parameter key="first_example" value="%{example}"/>
<parameter key="last_example" value="%{example}"/>
</operator>
<operator activated="true" class="cross_distances" compatibility="8.2.000" expanded="true" height="103" name="Cross Distances" width="90" x="514" y="85"/>
<connect from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Append" to_port="example set 2"/>
<connect from_op="Multiply" from_port="output 3" to_op="Filter Example Range (3)" to_port="example set input"/>
<connect from_op="Filter Example Range (3)" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Append" from_port="merged set" to_op="Filter Example Range (2)" to_port="example set input"/>
<connect from_op="Filter Example Range (2)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="output 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="82" name="Append (2)" width="90" x="447" y="187"/>
<operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID" width="90" x="581" y="187"/>
<operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (2)" width="90" x="313" y="289"/>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join" width="90" x="715" y="238">
<parameter key="remove_double_attributes" value="false"/>
<list key="key_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="849" y="187">
<list key="function_descriptions">
<parameter key="Result" value="if(round([distance],3)==1.732,1,0)"/>
</list>
</operator>
<operator activated="true" class="loop_examples" compatibility="8.2.000" expanded="true" height="103" name="Loop Examples (2)" width="90" x="983" y="187">
<process expanded="true">
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Result"/>
</operator>
<operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="380" y="85">
<list key="attributes">
<parameter key="Result" value="%{example}"/>
</list>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (2)" width="90" x="514" y="85">
<list key="key_attributes"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="648" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Result-%{example}"/>
<parameter key="attributes" value="id"/>
</operator>
<operator activated="true" class="transpose" compatibility="8.2.000" expanded="true" height="82" name="Transpose" width="90" x="782" y="85"/>
<connect from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Lag Series" to_port="example set input"/>
<connect from_op="Lag Series" from_port="example set output" to_op="Join (2)" to_port="right"/>
<connect from_op="Lag Series" from_port="original" to_op="Join (2)" to_port="left"/>
<connect from_op="Join (2)" from_port="join" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
<connect from_op="Transpose" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (3)" width="90" x="1117" y="136"/>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="82" name="Append (3)" width="90" x="1117" y="238"/>
<operator activated="true" class="transpose" compatibility="8.2.000" expanded="true" height="82" name="Transpose (2)" width="90" x="1251" y="238"/>
<operator activated="true" class="concurrency:loop_attributes" compatibility="8.2.000" expanded="true" height="82" name="Loop Attributes" width="90" x="1385" y="238">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="numeric"/>
<parameter key="except_value_type" value="attribute_value"/>
<process expanded="true">
<operator activated="true" class="aggregate" compatibility="8.2.000" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
<list key="aggregation_attributes">
<parameter key="%{loop_attribute}" value="sum"/>
</list>
</operator>
<operator activated="true" class="transpose" compatibility="8.2.000" expanded="true" height="82" name="Transpose (3)" width="90" x="581" y="34"/>
<connect from_port="input 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Transpose (3)" to_port="example set input"/>
<connect from_op="Transpose (3)" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="82" name="Append (4)" width="90" x="1519" y="238"/>
<operator activated="true" class="sort" compatibility="8.2.000" expanded="true" height="82" name="Sort" width="90" x="1653" y="238">
<parameter key="attribute_name" value="id"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="45" y="34">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): data.Result_2 = data.Result lenghtData = data.shape[0] for j in range(1,lenghtData): data.Result_2[j] = data.Result[j] + data.Result_2[j-1] # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (5)" width="90" x="1787" y="238"/>
<operator activated="true" class="concurrency:join" compatibility="8.2.000" expanded="true" height="82" name="Join (3)" width="90" x="1921" y="187">
<list key="key_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="2122" y="187">
<list key="function_descriptions">
<parameter key="Final_Result" value="Result+ att_1"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Multiply (3)" to_port="input"/>
<connect from_op="Multiply (3)" from_port="output 1" to_op="Loop Examples" to_port="example set"/>
<connect from_op="Multiply (3)" from_port="output 2" to_op="Generate ID (2)" to_port="example set input"/>
<connect from_op="Loop Examples" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Append (2)" from_port="merged set" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Loop Examples (2)" to_port="example set"/>
<connect from_op="Loop Examples (2)" from_port="example set" to_op="Generate ID (3)" to_port="example set input"/>
<connect from_op="Loop Examples (2)" from_port="output 1" to_op="Append (3)" to_port="example set 1"/>
<connect from_op="Generate ID (3)" from_port="example set output" to_op="Join (3)" to_port="left"/>
<connect from_op="Append (3)" from_port="merged set" to_op="Transpose (2)" to_port="example set input"/>
<connect from_op="Transpose (2)" from_port="example set output" to_op="Loop Attributes" to_port="input 1"/>
<connect from_op="Loop Attributes" from_port="output 1" to_op="Append (4)" to_port="example set 1"/>
<connect from_op="Append (4)" from_port="merged set" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Generate ID (5)" to_port="example set input"/>
<connect from_op="Generate ID (5)" from_port="example set output" to_op="Join (3)" to_port="right"/>
<connect from_op="Join (3)" from_port="join" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I hope it helps,
Regards,
Lionel
2
Answers
@Leo_179 Try using the Aggreate operator for this. Aggregate by all your attributes and then under the Grouping, use the sum method.
Hi Thomas,
thanks for your fast answer!
Due to I'm new on working with rapidminer, could you please explain your solution a little bit more?! I'm not quite sure how to use the aggregate operator in this case. And what operators do I also need to solve the problem?
Best regards,
Leo
Hi @Leo_179,
I was not able to create a process with 100% RapidMiner's operators, so, in this case, it is with great disappointment, that I used a Python script (I will explain further...) for the last part of the process.
To run this process, you must install the Python environment on your computer and install the Execute Python operator (from the MarketPlace)
Here the process :
In deed, with RapidMiner, I'm able to compute the distance between the examples (distance between example[i] and example[i-1]) and to obtain this :
However, I'm not able to perform with RapidMiner the very simple last operation, which consist to :
- create an attribute 'Total' initialized to 0
- Iterate to sum : Total[i] = Total[i-1] + Result[i].
and to finally obtain this :
So if someone has an idea to perform this last operation with RapidMiner, I am very curious to know it.
(and more generally to solve this problem using only RapidMiner/without script).
However, I hope it helps,
Regards,
Lionel
@lionelderkrikor will the Lag operator from the Series extension help?
Dear all,
thank you very much for your help! I'm now using the "Lag series" operator and it works quite well...
Best regards,
Leo