The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text Mining - Documents Similarity (words position)
silviabastos
Member Posts: 2 Learner III
Hello,
I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence.
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D
Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...
I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.
Best regards,
Silvia
I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence.
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D
Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...
I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.
Best regards,
Silvia
Tagged:
0
Answers
Instead of tokenizing your documents, you may want want to take a look at "Data to Similarity" which allows the computation of various types of nominal distances between entities. I am not familar with all the details of several of those distance metrics (Dice, Jaccard, Tanimoto, etc.) but it is possible that one or more of them is suitable for your purposes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @silviabastos
This is a great questions. To 'remember' to location of the key words, you can use "generate nGrams" for phrases search with term max length for 7 + and of course it will need more time for text processing.
Supppose you do not have many words in each document, ideally just like the examples showed in your message, we have three documents as simple as
You can use the levenshtein distance offered in Dr Martin Schmitz's toolbox extension. https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox
The Levenshtein distance is calculated as the number of changes needed to convert one string into the other. A common use case for this distance is spell checking.
Here is the xml of my process. HTH!
YY
Hi!
I will try both options.
Related to @yyhuang solution, I only wrote a small example in the first post, the texts I'm working have natural language, about 900 words, so I'm not sure if I can use it.
Related to @Telcontar120 solution, I make one first attempt, but I didn't get consistent results.
I will work a little more io this and I will post the found problems.
Any other solutions are wellcome.
Thank you.
Hi @silviabastos
Thanks for the followup! Maybe you can try word2vec for document with 900+ words?
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Training on a single corpus the word2vec algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings that help you understand the position and context of each word.
You can install word2vec extensions from marketplace.
HTH!
YY