The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to do OCR with images
![alsaqer002](https://us.v-cdn.net/6030995/uploads/defaultavatar/nCCNNSPK1YM69.jpg)
![](https://s3.amazonaws.com/rapidminer.community/vanilla-rank-images/contributor-16x16.png )
Hello all,
I want to do OCR on images in order to get text.
I already installed the (image Miner extension, Text processing extension, Aylien Text Analysis extension, and Feature Selection extension)
after I use the "Open Color Image" or "Open Gray-scale Image" operators, which operator should I use to recognize and extract the text and features from image file?
Any help will be appreciated.
Thanks,
I want to do OCR on images in order to get text.
I already installed the (image Miner extension, Text processing extension, Aylien Text Analysis extension, and Feature Selection extension)
after I use the "Open Color Image" or "Open Gray-scale Image" operators, which operator should I use to recognize and extract the text and features from image file?
Any help will be appreciated.
Thanks,
0
Answers
Hi. I'm new in here and maybe lucky to ask my question in the right place?
So what' s the matter? I have a highly aggregated pdf-file containing images, tables and metadata as text. Doing it manually, I would do some save as tables with the selectes tables within the pdf-file, save as image with the selected images and save as csv on the rest. Everything with the Adobe ...
After some cleaning i would have three seperated files: xlsx, csv and jpeg. The first two can be read in natively in RM the third, i suppose an extension working fine with it.
How can i do it better? The mentioned elemets appear always at the same position. Horthonworks and Zapier could be integrated to the process.
I'm looking forward to your contributions. Thank you in advance.
Lukas
The mentioned file looks like the pictures above.
Hello,
you need OCR for this, but even if you are able to convert image to text, it will be big problem with structure of the text and extract specific values. You can use Tesseract for OCR. If you are satified with output, you can order custom development of extension for RapidMiner.
Best wishes
Vaclav
Hi,
I somehow didn't notice Your reply. Sorry. And then, as always the time.
I'll take a look on Tesseract.
Thank You.
Beste Whishes
Lukas
There is a new extension on the Marketplace that RM just developed called PDF Table Extraction. Maybe that can help?
https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction
Hi Thomas,
Wow, that sounds good. I'm excited.Thank You a lot.
I will leave a feedback after having tried it out.
Best whishes.
Lukas