Create Document From Specific PDF Sections
Hi Everyone,
I am trying to create a single document from a pdf file that has different rows for each of the repetead sections in the single pdf file.
Example: I have a pdf document that has a lot of text however in the document there is one repeated section of comments that I would like to collect for each ID associated with the comments. I have used the process documents from files and used extract information operator with a start and end expression to capture the comments in between. it works for the first section that the start and finish expressions are found but doesnt captures the rest of the sections.
Please let me know if I need to explain this any further.
Thank you
Blah Blah Blah
Blah Blah Blah
ROWID
Start Section
Comments i need
End Section
Blah Blah Blah
Blah Blah Blah
ROWID
Start section
Comments i need
End Section
Final Example Set wotuld be in this form
ROWID - Comments
ROWID - Comments
ROWID - Comment
.
.
.
Best Answer
-
bhupendra_patil Employee-RapidMiner, Member Posts: 168 RM Data Scientist
Not specifically your use case, but this knowledgebase article does soemthign similar where we are cutting document based on fullstop/and/but
http://community.rapidminer.com/t5/Text-Analytics/Splitting-text-into-sentences/ta-p/31845
You can potentially apply the exact same process, but with different limiters based on your criteria.
let us know if this helps
0
Answers
Thanks for the reply. I actually figured out how to do it using the cut documents and extract information operators and it worked great!
On a different subject I was wondering if you can help me or point me tosomeone who can help me with the post I have submitted a while back and never got any reply from anyone else. http://community.rapidminer.com/t5/RapidMiner-Studio/Emphasize-certain-tokens-for-classification/m-p/31650
Thanks again for your help
You can read PDF files and turn them into readable text with the extension, PDF Table Extraction
https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction