The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to compare each row with all other rows?
Hello everyone
I have a very large Exampleset, more than 100,000 rows, it has two attributes, id, and long string. I want to find duplicates by comparing each row with all other rows using the string attribute.
The similarity does not have to be exact to detect the duplicate.
My idea is to use Cartesian Product to make a new Exampleset with attributes (id1,string1,id2,string2) then generate a new attribute for the similarity, but my problem is that the Cartesian operator does not support a large number of data. It displays an error saying the number of rows is limited.
Is there an alternative to this idea of using a Cartesian product? also, what is the best way to measure the similarity between two texts?
Thank you
I have a very large Exampleset, more than 100,000 rows, it has two attributes, id, and long string. I want to find duplicates by comparing each row with all other rows using the string attribute.
The similarity does not have to be exact to detect the duplicate.
My idea is to use Cartesian Product to make a new Exampleset with attributes (id1,string1,id2,string2) then generate a new attribute for the similarity, but my problem is that the Cartesian operator does not support a large number of data. It displays an error saying the number of rows is limited.
Is there an alternative to this idea of using a Cartesian product? also, what is the best way to measure the similarity between two texts?
Thank you
0
Answers
Dortmund, Germany
you could use Loop Batches on the second table and select a small batch size like 10. Then inside the loop you use Cartesian join with the current batch and the entire large table (you could use Remember/Recall to get it efficiently) and Generate Attributes to apply your match formula. Then Filter Examples for the matches.
Loop Batches doesn't have an output, so you could use a database for storing the current results, or CSV files with a counter you're incrementing in the loop, or Recall+Append+Remember for storing the results inside the RapidMiner process.
Regards,
Balázs