The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extract data from pdf files and perform text analysis
Studentul_86
Member Posts: 11 Learner I
in Help
Hello,
I'm a recent user of RapidMiner, using the free educational solution, for one academic paper I'm working on.
The problem is I did not found any possibility up-to-now to extract data for text analysis in RapidMiner from pdf files.
Can somebody help me advice me with a process or any advice on how I can extract in RapidMiner text from multiple pdf files at once and reach this way my target of counting words?
Also, related to sentiment analysis of texts, can somebody give me hints on free solutions in RapidMiner to perform?
Thank you.
Best regards,
Valentin.
I'm a recent user of RapidMiner, using the free educational solution, for one academic paper I'm working on.
The problem is I did not found any possibility up-to-now to extract data for text analysis in RapidMiner from pdf files.
Can somebody help me advice me with a process or any advice on how I can extract in RapidMiner text from multiple pdf files at once and reach this way my target of counting words?
Also, related to sentiment analysis of texts, can somebody give me hints on free solutions in RapidMiner to perform?
Thank you.
Best regards,
Valentin.
0
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi,The Read Document has an option to read pdfs. You want to combine this with a loop files operator.Best,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0 -
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi Vali,
I'm not sure what you are looking for , thus I propose 2 options based on Martin's idea :
- Process 1 (in attached file) : Read Document inside a Loop Files operator, then a Process Documents operator
- process 2 (in attached file) : Read Document inside a Loop Files operator, then a Combine Documents operator, then a Process Documents operator.
Tell us if one of these processes answers to your request...If not can you elaborate what you want to achieve ?
Regards,
Lionel1 -
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi Vali,
It seems that your Loop Files is not correctly set.
Please import the second process (Loop_read_pdfs_documents.rmp) I shared in my previous post and set in the parameters of the Loop Files operator the path where the PDFs files are stored in your case.
Regards,
Lionel0
Answers
Thank you for the advice. I've tried, but unfortunately I see this loop file is used in case I want to concatenate multiple Read Document operators. Which is the solution for me to import at once through a Process Documents operator about 300 pdf files?
Thank you for your support.
Best regards,
Vali.
I've achieved to created the set of documents uploaded in my RapidMiner process.
However, now I face with a strange situation. All pdf files uploaded on my process do not lead to a word list. In the files attached you can see the process I've designed, a really simple one. Instead the results show nothing, no word, no list of documents analyzed. What did I do wrong? This process was tested only for 5 pdf files uploaded.
The idea is simple what I need to do with those about 300 hundred pdf files. I want to:
- create a list with the words and their count on the files;
- get the files length (number of words);
- get the correlation between words, for some specific terms;
- get a set of graphical associations for those specific terms;
etc.
Unfortunately I'm stuck on the very beginning of the process. I need your advice or anybody else from this community.
Thank you,
Best regards,
Vali.
I've made the change you recommended me. Now it shows me that the loop file is not properly working because there are not enough iterations? What that mean...? PLease help me with what should I still have to change...? Attached you can find the error I'm talking about.
Thank you,
Vali.
Your current flow works on a single pdf a time, where you most likely need all of these combined to get some decent tfidf results.
Just try to ensure you already get something in the first place. Loop through the pdf's just combine them and see if you get results. Start with a few, combine these and see if you get content in the first place using the combine documents operator.
Then use tfidf on that one, tuning the prune on the go.