Text Mining: analyse PDFs with a dictionary which has categories
Hello,
I want to analyse a number of PDFs (35) with kind of a dictionary. The output of the analysis should be an Excel File which shows how often every single word of the dictionary appears in the PDFs. Maybe it's important to know that the dictionary is not only a list of words. Instead the words are classified into five categories. Thus the analysis should give me information about how much is reported on the words of the dictionary and about which category is reported the most.
I already read lots of questions here and also watched tutorials, but I could not find exactly what I need. Trial and error didn't work as well up to now. Hope someone can help me.
Many thanks in advance,
Nina
Answers
Dortmund, Germany
Thanks for your help,
Nina
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Yes, you're right. I have a word list with key words (which are categorized) and want to scan all my PDFs for these words. Thus I only want to see this words and their occurence in the result view.
I tried your proposal, but I couldn't put the Wordlist into the input port and then connect with the process documents operator as an error occured. Furthermore I'm not sure where to add all my PDFs that should be analysed. Are both, the wordlist and the PDFs, set as an input for the process documents operator?
I hope my problem is not too confusing. Maybe it helps to have a look at the XML I posted before.
@nsmith see my comments above regarding the wordlist input. It may be that you need to generate your wordlist first. Regarding the pdfs, you can use Process Documents from Files and then set your parameters to read your pdf files from your hard drive.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@mschmitz @Telcontar120 thank you very much for your answers, it's nearly working now!
Unfortunately there is still one problem with the "Filter Tokens Using ExampleSet" operator. I want to filter with my word list, which has two kinds of words.
In general it's working as I used the "Generate n-gramms" operator before. Thus all stand-alone words and terms I specified are in the result list. The problem is that the operator generates also terms, which I did not exactly mention in the word list. An example is "accelerating_digital". Even though I did not have this term in my word list, I want to have it in my result list as it contains the word "digital" (which is in my word list).
Is there a way to solve this problem?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts