PDF encoding issue

limegreenman900 · July 2016

Hi everyone,

I was trying to do the most simple one can do, by reading a PDF file into RM.... I have done this several times before, but now I am stuck with (I suspect) an encoding issue.

After using the "Read Document" Operator (extract text only and use file extension as type are tick-marked) I inserted a breakpoint, before I do some preprocessing of the text. However I don't get any text out of my PDF, what I get instead is something like:

¨ÉøC&13#s$ó/Y¢¬–¬³ÙÜìâì=ÙOsbsúrnåºçsOæ1óŠòvç=Ë�ËïÏŸ\ä»hÙ¢óÖê‚#…¤Â¼Â�…³‹ãoZ<]TÔUt}‰`IÃ’sK—V-ý¤˜Y,+>TB(É/ÙSòƒ,]6*›-•–¾W:#—È7Ë*¢ŠÊe¿ò^YDYÙ}U„j£êAyTù`ù#µD=¬þ¶"©b{Å³ÊôÊ+¬Ê¯: !kJ4Gµm¥ötµ}uCõ%�—®K7YV³©fFŸ¢ßYÕ.©=bàá?SŒîÆ•Æ©ºÈº‘ºçõyõ‡Ø
Ú†�ž�kï5%4ý¦m–7Ÿlqlio™Z³lG+ÔZÚz²Í¹³mzyâò]íÔöÊö?uøuôw|¿"Å±N»Îå�wW&®ÜÛeÖ¥ïº±*|ÕöÕèjõê‰5k¶¬yÝèþ¢Ç¯g°ç‡^yïkEk‡Öþ¸®lÝD_pß¶õÄõÚõ×7DmØÕÏîoê¿»1mãál {àûMÅ›Î
nßLÝlÜ<9”úO

Anyone an idea where the problem is? I would suggest that it is an encoding issue?!

If I go into the PDF file and Copy+Paste the text into a Word File there is no problem and the text is displayed in a correct manner....

Thomas_Ott · July 2016

You can change the encoding on the Read Documents operator. Just enable the advanced settings and a new parameter box will show up in the parameter window. From there you can change the encoding.

limegreenman900 · July 2016

I am working with RM5.3, so by displaying the "Read Document" operator encoding is set by default to "System". This should automatically match the correct encoding right?

MartinLiebig · July 2016

Hi,

usually it is. If you have a UTF file on a windows machine it might not work. So I would give it a try with UTF-8.

~Martin

limegreenman900 · August 2016

@mschmitz: I gave it a try with UTF, but it didn't work. I'll figure out another way, somehow it has to work.

Nevertheless, thanks for your help.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

PDF encoding issue

Answers