Adding a column that performs a distinct count

darrenvermaak · September 2016

Hi all,

Very new to Rapidminer!

I am needing to add a column into my table that runs a distinct count.

This is a sample of my table currently:

CaseNumber	Date
1A	1/1/2015
1A	1/2/2015
1A	1/2/2015
2A	2/1/2015
3A	3/1/2015
2A	4/1/2015

The Distinct Count would perform a distinct count on Case Number so:

CaseNumber	Date	DistinctCount
1A	1/1/2015	1
1A	1/2/2015	0
1A	1/2/2015	0
2A	2/1/2015	1
2A	3/1/2015	0
3A	4/1/2015	1

Basically I just want to count how many unique case numbers there are.

So casenumber 1A occurs 3 times, but it's still just the 1 case number.

Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.

Thomas_Ott · September 2016

Hi,

That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"

Here's a sample process attached.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Products" width="90" x="246" y="85">
        <parameter key="repository_entry" value="//Samples/data/Products"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="447" y="85">
        <list key="aggregation_attributes">
          <parameter key="Product Name" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Product Name"/>
        <parameter key="only_distinct" value="true"/>
      </operator>
      <connect from_op="Retrieve Products" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

darrenvermaak · September 2016

Hi Thomas,

Thanks for the reply.

I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.

When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.

Telcontar120 · September 2016

Why not just use the "Remove Duplicate" operator? Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count. You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.

Best,

darrenvermaak · September 2016

There are a number of additional fields in this dataset that I need.

In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.

Unfortunatley removing duplicates is not an option.

darrenvermaak · September 2016

I found a way to do this:

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Adding a column that performs a distinct count

Answers