The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Compare two customer databases"
xiaobo_sxb
Member Posts: 17 Contributor II
Hi
I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?
Best Regards
Steven
I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?
Best Regards
Steven
Tagged:
0
Answers
The Join operator lets you join tables together. You could also use a distance to similarity approach to see what records are close to one another.
Regards,
Andrew
Thank you for your reply. I still have questions for your proposal.
First, the join operator require the two dataset have the same ID (the key). For my case, I don't have the same ID.
For the "data to similarity" operator, still not good enough. First, it will create a cross join across all the rows, in my case I have more than 10K rows for both of the tables, and I doubt the performance. Second, even I have the similarity score, I don't know the threshold for determining the possibility of two rows as the same customer. Is it possible to generate the possibility to say, how much percent of confidence we can say the two customers are actually the same one?
Regards
Steven
Well if there is no common ID then there is obviously no way to use Join.
Actually a better operator would be Cross Distances which allows the selection of the top k nearest. The threshold completely depends on the data you have and I can't answer that.
Performance may not be that bad; you have to try it.
regards
Andrew
Regards
Steven
Andrew
An outer join is the join type you'll want to use in your case to spot customers that are part of only one of the tables.
Best regards,
Marius