The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Deduplicate names in RapidMiner with Rosette

jeannejeanne Member Posts: 4 Contributor I
edited July 2020 in Knowledge Base

New Rosette API endpoint and Rapidminer operator for data cleansing

 Recognizing and reconciling duplicate records is a common headache of database management especially when the differences are subtle and likely to be missed by most deduping systems. If your records include duplicate records that include misspellings, nicknames, initials, and non-Latin scripts, you may be missing connections, keeping your agents and team members from the information they need.

 

Rosette API launched a new /dedupe endpoint which utilizes our industry-leading fuzzy name matching to connect database records that contain moderate, or “fuzzy,” variations. Unlike other deduplicators that can only pick out exact matches, Rosette allows the user to find and reconcile similar records for cleaner databases. To make this functionality more easily accessible, we simultaneously released a “Deduplicate Names” operator for Rapidminer Studio which uses the new endpoint under the hood.

 

The Rosette Deduplicate Names operator identifies candidate duplicates from a list of names by assigning “group ids” to groups of matching names. The operator can process lists of up to 10,000 English names and assigns group ids based on a user-specified match threshold. The threshold sets the minimum similarity score required for two names to be considered duplicates. Thresholds can be set by clicking on the operator and entering a value between 0 and 1 in the “Threshold” field. We recommend starting with a .8 threshold, and experimenting with higher or lower values depending upon your use case and results.

 

Given a list of names as input, the output is a list of cluster IDs (integers) for each name—not in any particular order. The output may then be sorted by cluster ID to group together possible duplicate names. For example:
Screen Shot 2017-11-17 at 3.53.12 PM.png

 Screen Shot 2017-11-17 at 3.53.23 PM.png

Further refine your results with additional fields

When you submit a name-deduplication request in Rapidminer, you need only input a list of names; however, you can also set the entity type–if known–to person (default), location, or organization to improve accuracy.

The Rosette API /deduplication endpoint also supports additional language and script fields beyond those offered in Rapidminer to further improve your results.

 

Screen Shot 2017-11-17 at 3.52.40 PM.png

Try it yourself

Ready to get started deduplicating the names in your data? First, sign up for a free Rosette API key (up to 10,000 calls/month) then head over to Rapidminer.

 

If you need to process large volumes of records or would prefer not to send your data to the cloud, talk to our sales team about custom solutions and on-premise deployments.

Sign In or Register to comment.