"Aggregate attributes despite spelling errors"
New to data mining and rapidminer so any help is appreciated. I have a database with a column {Company Name}. I need to get a total Company count but the problem is there are spelling errors and inconsistencies in spelling in this attribute so a simple removal of duplicates doesn't work. I have around 15K results but I'm guessing there are really only about 800 actual companies in my database. Trying to avoid manually removing them in the CSV
Example:
ABC Company
ABC Co.
ABC Company Inc.
ABC Company Inc
ABC Company, Inc
ABC
I'd want the above to be grouped into 1 group since it's all the same company. I've only spent a few hours in Rapidminer but figured I'd ask if this is even possible before spending more time. Can I make a process that is smart enough to automatically aggregate or group attributes so I have an idea of total Companies? Doesn't have to be 100% accurate.
Answers
Hi Matt,
the key question is: Is there a pattern to exploit? Something like a list of words which should be removed (Inc, Corp,...) or so? If yes, you can do this with RapidMiner. Maybe you can even do a cross distance on the n_grams (chars) to find spelling misttakes.
~martin
Dortmund, Germany
Hi Martin,
Yes, removing words like co and inc would help - and I can do that in excel - but then I'm still left with spelling variations.
For example:
Johnson and Son
Johnson and Sons
Johnson and Sons Construction
Could examining the first two words in a row be a pattern to exploit?
Hi Matt,
My team has developed an extension to perform text analytics in RapidMiner - the Rosette Text Toolkit, which runs on our Rosette API. Our company has a lot of experience with name ambiguity (or name matching), and our extension has an operator called Match Names that produces a score that represents the likelihood that two names are the 'same'. It is designed to work cross-lingually (like when the name of a person in Russian is written in English, the way it's written in the English alphabet could vary a lot), but it's decent at English-English too.
So, as an example, if you sign up for our free trial and try comparing 'Johnson and Son' and 'Johnson and Sons Construction' in our name-similarity endpoint here, the Rosette API would return a score of ~.79 -- sort of a 79% chance that those two names are the same. Comparing just 'Johnson' with 'Johnson and Sons Construction' would return ~.65 -- which you could say is not good enough confidence to be necessarily the same. In the same way, our RapidMiner Match Names operator will take two names in the same row of the two specified input attributes and return the similarity score.
We're starting to build a RapidMiner operator that really "de-duplicates" lists of names (versus just shows you a score) -- and we'd love some feedback if you're willing. Pairwise matching is a tricky computational challenge -- which is why my company has larger on-premise solutions for "indexing" names (like building custom databases of unique entities for enterprise customers), which is currently outside of RapidMiner.
Please let me know if you need help or are interested in giving feedback for a name de-duplicating RapidMiner operator.
Thanks,
Lauren
I think you can do it via replace operator if that you were asking