The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
I have two excel i want to find if row number 1 from one excel is present in my second exel or not.
I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy matching in python .
Its is text "outlet name can be name of any outlet , address can be any address city state and zip . how we can see if that row is present in other excel or not . if not completely then by what percent it is matching
Its is text "outlet name can be name of any outlet , address can be any address city state and zip . how we can see if that row is present in other excel or not . if not completely then by what percent it is matching
Tagged:
1
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi @mschmitz,
the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1.
In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.
Regards,
Balázs8
Answers
Dortmund, Germany
How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.
operator doc
https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
previous discussions/knowledge base
https://community.rapidminer.com/discussion/53879/cross-distance-how-is-it-calculated
https://community.rapidminer.com/discussion/52140/two-documents-similarity-using-cross-distance
HTH!
YY
i think we got three options here:
1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
2. Cross-Join the data, use Generate Levenshtein Distance
3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?
Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?
Best,
Martin
Dortmund, Germany