Compare 2 pdf texts
Hello,
I'm trying to create a process which consist on comparing 2 pdf that are subtly different.
I process my documents (tokenize, filter stopwords, generate n grams...) from two differents files and merge it into one common example set with the operator "Append" and use the operator "Remove duplicates" to see differences in the pdf. Please find attached my process, I have 2 questions :
1) Is it possible to convert my example set result into a wordlist to have a table by row rather than column ?
2) It seems that something went wrong because there are words which are in the 2 files which appears in the output, while it should show words that are in a specific document and whiich is absent in the other one, and so on
Thanks !
Sabine
Answers
Please find attached a screen of my process, the second pictures describe what is contained inside the two operators "Process document from files".
When you generate the original wordlist from each pdf, you can use "Wordlist to Data" operator to create examplesets of the words and their counts. You could then add a source field (with Generate Attributes or via a macro) for each pdf, and then merge/join those two datasets. That should enable you to see easily which words are common to both files and which ones are unique to one or the other.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts