The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Adding a column that performs a distinct count
darrenvermaak
Member Posts: 4 Contributor I
Hi all,
Very new to Rapidminer!
I am needing to add a column into my table that runs a distinct count.
This is a sample of my table currently:
CaseNumber | Date |
1A | 1/1/2015 |
1A | 1/2/2015 |
1A | 1/2/2015 |
2A | 2/1/2015 |
3A | 3/1/2015 |
2A | 4/1/2015 |
The Distinct Count would perform a distinct count on Case Number so:
CaseNumber | Date | DistinctCount |
1A | 1/1/2015 | 1 |
1A | 1/2/2015 | 0 |
1A | 1/2/2015 | 0 |
2A | 2/1/2015 | 1 |
2A | 3/1/2015 | 0 |
3A | 4/1/2015 | 1 |
Basically I just want to count how many unique case numbers there are.
So casenumber 1A occurs 3 times, but it's still just the 1 case number.
Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.
0
Answers
Hi,
That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"
Here's a sample process attached.
Hi Thomas,
Thanks for the reply.
I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.
When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.
Why not just use the "Remove Duplicate" operator? Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count. You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.
Best,
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
There are a number of additional fields in this dataset that I need.
In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.
Unfortunatley removing duplicates is not an option.
I found a way to do this: