The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting coloured parts of my document
Raphael2304
Member Posts: 4 Learner I
in Help
Hey all
I have a sample of approx. 15.000 press releases where different parts were coloured. Is there any possible solution that I extract these parts to a table where every release is a row and the different coloured parts are the coloums?
Thanks a lot in advance and have a great weekend!
Best regards from Germany
Raphael
I have a sample of approx. 15.000 press releases where different parts were coloured. Is there any possible solution that I extract these parts to a table where every release is a row and the different coloured parts are the coloums?
Thanks a lot in advance and have a great weekend!
Best regards from Germany
Raphael
0
Answers
In case of the latter it means Rapidminer considers these special attributes (like an ID or a label) and you can bypass these by either selecting 'include special attributes' in the operators sensitive for this, or make them regular using the Set Role operator.
In case of source data colors, please share so we can have some better understanding.
Thanks a lot in advance!
The normal read document will therefore not help as it strips all of the layout from the pdf and you're left with the text only by default.
So this leaves it to patterns, if there is a designated word or sentence in the yellow part it becomes relatively easy. At first glance it appears that all paragraphs that contain the word Rorsted are marked, so there is your pattern.
-> Load your text, split in paragraphs (basically is there one or more empty line between text) and if one of these contains Rorsted it was yellow otherwise white. I've attached a basic example coming close to give you the basic idea.
If this is not the case and the marking would be at random it becomes a whole lot more complex and you're left with 2 other options (to my awareness), but both pretty advanced and they do require python knowledge.
- PDF's are in essence a form of XML behind the scenes that construct your page in boxes, telling the location and part of the markup. So oversimplified the code behind the document will be something like 'box with x-y coordinates containing yellow overlay with text' but than in XML format. There are quite some python pdf to text packages that can deal with this, pdfminer is one of them. This will convert the pdf to the XML format, and then you can use XPath to get the colored areas versus non colored areas. If you're familiar with these you can basically do whatever you want with pdf's, but this works best with continuous text as in your sample.
- Another option would be to use computer vision (like opencv) where you split your pdf in smaller pdf's based on the background. So if you have a pdf starting with a white background square, followed by for instance yellow, white, yellow, white and so on backgrounds you could split these and deal with them this way.
also check these resources. I worked with them with good results
Extract annotations and highlighted passages from PDF files - Steve Powell's blog (pogol.net)
How To: Extract Highlighted Text from a PDF File | francisco morales
GitHub - Samathy/pdfcommentextractor: Extracts highlighted text from PDF documents.
Regards