The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

PDF encoding issue

limegreenman900limegreenman900 Member Posts: 26 Contributor II
edited November 2018 in Help

Hi everyone,

 

I was trying to do the most simple one can do, by reading a PDF file into RM.... I have done this several times before, but now I am stuck with (I suspect) an encoding issue.

After using the "Read Document" Operator (extract text only and use file extension as type are tick-marked) I inserted a breakpoint, before I do some preprocessing of the text. However I don't get any text out of my PDF, what I get instead is something like:


¨ÉøC&13#s$ó/Y¢¬–¬³ÙÜìâì=ÙOsbsúrnåºç&#26;sOæ1óŠòvç=Ë�ËïÏŸ\ä»hÙ¢ó&#5;Ö&#5;ê‚#…¤Â¼Â�…³‹ã&#23;oZ<]&#20;TÔUt}‰`IÃ’sK­—V-ý¤˜Y,+>TB(É/ÙSòƒ,]6*›-•–¾W:#—È7Ë&#31;*¢&#21;&#3;Š&#7;Ê&#8;e¿ò^YDY&#127;Ù}U„j£êAyTù`ù#µD=¬þ¶"©b{ųÊôÊ&#15;+&#127;¬Ê¯: !kJ4Gµ&#28;m¥ötµ}uCõ%�—®K7Y&#19;V³©fFŸ¢ßY&#11;Õ.©=bàá?S&#23;ŒîÆ•Æ©ºÈº‘ºçõyõ‡&#26;Ø
Ú†&#11;�ž�k&#26;ï5%4ý¦&#25;m–7Ÿlqlio™Z&#22;³lG+ÔZÚz²Í¹­³mzyâò]íÔöÊö?uøuôw|¿"&#127;űN»Îå�wW&®ÜÛe֥ﺱ*|ÕöÕèjõê‰5&#1;k¶¬yÝ­èþ¢Ç¯g°ç‡^yï&#23;kEk‡Öþ¸®lÝD_p߶õÄõÚõ×7DmØÕÏîoê¿»1mãá&#1;l {àûMÅ›Î
&#6;&#14;nßLÝlÜ<9”úO

Anyone an idea where the problem is? I would suggest that it is an encoding issue?!

 

If I go into the PDF file and Copy+Paste the text into a Word File there is no problem and the text is displayed in a correct manner....

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You can change the encoding on the Read Documents operator. Just enable the advanced settings and a new parameter box will show up in the parameter window. From there you can change the encoding. 

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    I am working with RM5.3, so by displaying the "Read Document" operator encoding is set by default to "System". This should automatically match the correct encoding right?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    usually it is. If you have a UTF file on a windows machine it might not work. So I would give it a try with UTF-8.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @mschmitz: I gave it a try with UTF, but it didn't work. I'll figure out another way, somehow it has to work.

    Nevertheless, thanks for your help.

Sign In or Register to comment.