The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Remove duplicates selecting which examples must remain
arthurgouveia
Member Posts: 5 Contributor II
Hello.
I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.
I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.
How can I do it?
Thanks
I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.
I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.
How can I do it?
Thanks
0
Answers
you can use the RapidMiner operator called Filter Examples.
Cheers,
Ralf
in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.
Best regards,
Marius
But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.
Is there any way to remove duplicates keeping only the most recent data?
this requires several steps:
- You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
- The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
- With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
- With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.
Cheers,Ralf
What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.
It's working! Thanks!