Arrange list of names by similarity?
Hi All,
I am a complete novice with RapidMiner and despite watching muliple videos and trawling the forum, I am unable to get my head around how to solve what I think is a very simple problem!
I have a list of names (approx 5k), all I want to achieve is to sort this list of names by similarity.
All that I have process wise so far is....
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/email test"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="name_recipients"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="514" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I would be most grateful for anyone's assistance.
Kind Regards
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi again @DAVID_EALES,
Interesting but difficult task.....
I found a ressource which seems interesting for your project in the community.
To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.
This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.
Tested like this with your (very partial) example set :
this process give the following result :
I hope it will be useful.
Regards,
Lionel
0
Answers
Hi @DAVID_EALES,
Here a process, which compute and sort the Distance between the names of a list, using the Data to Similarity operator :
I don't know your dataset and what exactly you want to do, but, in case of nominal attributes (the names in your case), the distance will be always 0 (in case of perfect matching between
the 2 names, in other words the 2 names are the same) or 1 (in the other cases). So your table will be filled only with "1" and "0".
Regards,
Lionel
In the free Operator Toolbox extension, there is an operator to Generate Levenshtein Distance, which is more in line with I think what you want to do. But I am not sure exactly what you mean by sorting the list because to do that you would first have to select one name as the reference name to which all other names' similarity would be calculated.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks to all for your replies thus far
To explain further, I want to group/cluster? email addresses based on similarity rather than alphabetically so for example....
Alphabetical sort....
What I am trying to achieve....
a.user@domain.com
another.person@domain.com
1joe.bloggs@domain.com
joe.bloggs@domain.com
soe.blogs@domain.com
k@domain.com
I understand about the distance measurement, but how do I take that distance measurement and use it to rearrange the output?
Hope the above makes sense.
Kind Regards
Many Thanks Lionel, your idea worked.
Kind Regards
Ok, so the solution proposed by Lionel worked during testing, but I am unable to get it to run through the entire list as I am getting Error 504.
I have split the data into batches of 1000 rows and it all processes fine but I need it to be able to process the entire list of 5k entries at once.
Is this somesort of timeout error? I have looked at the rosette documentation and I cant find any mention of it.
Kind Regards
Hi @DAVID_EALES,
Accordind to your last message, It's working for dataset up to 1K rows --> OK
But : normaly, it work with dataset up to 10k rows grasiously (see the documentation (description) of RapidMiner)).
I contacted the support of Rosette to see what's going on with this error (error504).(maybe an updated limitation...)
Regards,
Lionel
Hi @DAVID_EALES,
It seems that your hypothesis is the right one.
Rosette is working on a fix for the next release. Here the answer of Rosette :
"Lionel,
We were able to trace this to an internal issue where our Name Deduplicate endpoint is timing out on large calls. Our suggestion would be to break the calls up to smaller chunks. We have an open an internal L3 Issue to correct this in a future release. Also for future reference here is a link to our error codes.
https://developer.rosette.com/features-and-functions#errors
I will hold this ticket open and will provide you a follow on update once we release a complete fix for this issue.
Best Regards,"
Regards,
Lionel
Thank You Lionel, much appreciated.