The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
filter all duplicate examples
hi
I'm a newbie in rapidminer. I want to filter all the example that has duplicate value, i use below process but if a name appears 5 times the result show 4times of it how can I filter all the 5 and still have other attr in my result...
Tagged:
0
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Simple, this is just the complement of what I already posted. Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates. I thought you did NOT want the duplicates.
3
Answers
Hi @neginz,
Can you share your dataset and your process ?
Can you too explain with an example what you get now and what you want obtain ?
Regards,
Lionel
If I understand you correctly, you want to eliminate any records that have duplicates. Here's a simple technique I have used to do this in the past. First, use Aggregate to group by name (or whatever constitutes the unique key that defines a duplicate, and note this can be more than one field) and count of name, which will give you a count of how many times each name appears. Filter Examples for that set for any record that has a count greater than one, and then Join (using Inner Join) back to the original dataset. Presto---you should then have only the records that appeared once!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hi @lionelderkrikor
my data are customer's comment and I want to extract rows with authors comment more than one time. in the process, I create for example when we have 2 rows with the same author the result show only one of them .(when the absolute count in pic =2 )I think its because of the operation "remove duplicate" it removes only duplicates value, not all of the value that has duplicates actually one of them remains and not remove.
screenshot of data
hi @Telcontar120
tnx for the help. I try that before without joining part and the result only has 2 attr that one of them is a count and the of the other.coud u please more explain about the joining part?
result without inner join operator
If you post a small data sample, it would be easier to help you.
Basically you want to take the output you are showing, but filter it for those records that only have a count of 1.
Then you will use that to join back to the original full dataset that has all the duplicates, but the inner join will only keep the records that have a count of one.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120
sorry, but how can I post excel data here. it has error for file extension even whenIi use.rar .
Just post it as csv or txt
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
tnx sorry @Telcontar120
its small sample of my data. I want my result have the "comment id" attr.
sorry for my English
Here is a process that does what you describe in your original post. It removes posts from authors that have more than one comment (e.g., it removes all items included in duplicate sets by author). You should be able to adapt this to your needs very easily. The first operator will need to have the path to your data file modified of course.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120 tnx for ur help but there is not what I wanted. I need the result like in the picture below. I hope you'll get it now.
@Telcontar120
yes, it works tnx a lot for ur help :smileyvery-happy: . my mistake was that I count the author instead of count comment id . . .
Hi @Telcontar120,
I will be severe :
I'm waiting from an Ambassador and beta tester of RM 9, that you realize this task with the new "turbo prep" tool: it is feasible !
Dataset :
Result :
.....I'm joking of course !!!.....:catwink::catlol:
Have a nice day and happy experimentations,
Regards,
Lionel