Identify Duplicate examples

aliasgarscool · September 2016

Hi,

I've a data in which I want to identify duplicates (unlike remove duplicate i want duplicate fields)

For example I've below data

Month Name Amount

Jul-15 John 10$

Aug-15 Alex 15$

Sep-15 John 5$

Jul-15 John 10$

if the above table is my input then i want only below in my results

Month Name Amount

Jul-15 John 10$

dr-connie-brett · September 2016

If you don't actually need the duplicated examples, but rather need the count of how many times they appear this is how I would handle it:

1 - aggregate the table (Aggregate operator - group by all attributes and count on one of them)

2 - filter examples for all count(attribute) > 1

Screen Shot 2016-09-25 at 9.59.00 AM.png

I'm assuming since there is no unique identifier you are ignoring you don't really need the duplicates the number of times they appear, but it might be useful to know how many times they appear!

sgenzer · September 2016

hi...that was a good puzzle. I would do it this way:

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
 <context>
 <input/>
 <output/>
 <macros/>
 </context>
 <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
 <process expanded="true">
 <operator activated="true" class="generate_id" compatibility="7.2.002" expanded="true" height="82" name="Generate ID" width="90" x="179" y="136"/>
 <operator activated="true" class="multiply" compatibility="7.2.002" expanded="true" height="103" name="Multiply" width="90" x="313" y="136"/>
 <operator activated="true" class="remove_duplicates" compatibility="7.2.002" expanded="true" height="82" name="Remove Duplicates" width="90" x="514" y="34">
 <parameter key="attribute_filter_type" value="subset"/>
 <parameter key="attributes" value="Amount|Month|Name"/>
 </operator>
 <operator activated="true" class="set_minus" compatibility="7.2.002" expanded="true" height="82" name="Set Minus" width="90" x="715" y="136"/>
 <connect from_port="input 1" to_op="Generate ID" to_port="example set input"/>
 <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
 <connect from_op="Multiply" from_port="output 1" to_op="Remove Duplicates" to_port="example set input"/>
 <connect from_op="Multiply" from_port="output 2" to_op="Set Minus" to_port="example set input"/>
 <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
 <connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
 <portSpacing port="source_input 1" spacing="0"/>
 <portSpacing port="source_input 2" spacing="0"/>
 <portSpacing port="sink_result 1" spacing="0"/>
 <portSpacing port="sink_result 2" spacing="0"/>
 </process>
 </operator>
</process>

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Identify Duplicate examples

Best Answer

Answers