Comparing a Document with Multiple Example Data
Hello,
I would like to first process one or more documents (tokenize, n-grams, etc. -> done) and then compare each document with several sample data lists. If there is a match/similarity, the name of the respective list should be matched to the original document. If the documents contain common tokens but do not agree with a list, then "Others" should be mapped additional. It should later be possible to trace which lists fit into a document. I imagine this to be similar to a sentiment analysis with a training model, except that besides positive and negative there are a lot of assignments. Unfortunately, I don't find an approach how to proceed.
I would appreciate your help :smileyhappy:
Answers
Hi @Nicson,
If I good understand what you want, here a starting point with a process with one wordlist and one document.
I create an attribute with the value :
- "wordlistname_documentname" if all the words of the wordlist are present in the document
- "wordlistname_documentname (others)" if only a part of the words of the wordlist are present in the document.
Here the process :
I think this process can be improved maybe with a Loop operator and/or Select Subprocess operator to generalize
it at N documents and N wordlists.
I hope it will be helpful.
Regards,
Lionel
Hey @Nicson,
i think what you want to do is to tokenize / n_gram the reference data set and the normal data set the same way and afterwards use a cross distance operator with cosine similarity to find similar items.
Best,
Martin
Dortmund, Germany
Thank you for your answers.
Yes, I have a reference dataset in every list and I want to compare it with every actual document. I created a little visualization to illustrate my project.
The list "Documents" contains all documents, List_A - List_C are the reference lists, which should be checked for their similarity to the contents of the documents. It is also important that the reference data is not only single words but also word pairs (n_grams).
The second picture shows how I imagine the output of the data.
kind regards
hello @Nicson - welcome to the community. Helpful hint from moderator: attach your csv/xls files to your posts so the kind people helping you don't have to recreate them.
Scott
@sgenzer Thanks for your advice, I'll take it into account for future postings.
@mschmitz
I have just been looking at the Cross Distance Operator and its tutorial process. What this operator does is understandable for me, but I have problems to apply it to my project. Assuming I have a single document that I want to compare with a word list, what should this process look like?
Hi,
Have a look at the attached process. This would be my first try. Another way could be to use the Dictionary Based Sentiment Learner and miss use it to check how many tokens of your list are in the text.
Cheers,
Martin
Dortmund, Germany