The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to compare each row with all other rows?

HaMu299HaMu299 Member Posts: 1 Learner I
Hello everyone

I have a very large Exampleset, more than 100,000 rows, it has two attributes, id, and long string. I want to find duplicates by comparing each row with all other rows using the string attribute.
The similarity does not have to be exact to detect the duplicate.

My idea is to use Cartesian Product to make a new Exampleset with attributes (id1,string1,id2,string2) then generate a new attribute for the similarity, but my problem is that the Cartesian operator does not support a large number of data. It displays an error saying the number of rows is limited.

Is there an alternative to this idea of using a Cartesian product? also, what is the best way to measure the similarity between two texts?

Thank you 

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,
    For exact duplicates: You can just inner join with a key on every column. Only duplicates remain.
    Otherwise: Likely either cross distance or something with fuzzy Matching. It depends a bit on how you define duplicates.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @HaMu299,

    you could use Loop Batches on the second table and select a small batch size like 10. Then inside the loop you use Cartesian join with the current batch and the entire large table (you could use Remember/Recall to get it efficiently) and Generate Attributes to apply your match formula. Then Filter Examples for the matches.

    Loop Batches doesn't have an output, so you could use a database for storing the current results, or CSV files with a counter you're incrementing in the loop, or Recall+Append+Remember for storing the results inside the RapidMiner process.

    Regards,
    Balázs
Sign In or Register to comment.