The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

UTF-8 encoded text doesn't get right out of the Get Page operator

s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II
edited December 2018 in Help
Dear RapidMiners,

I am having an issue with the Get Page operator and UTF-8 encoding.

I am scraping the content of this web page:


According to the html code I get out of Get Page, this page uses UTF-8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The problem is that for example: FDA’s turns out as FDA’s.

I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:

"Encoding 'SYSTEM' is not supported"

Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?

Many thanks in advance for any kind of input!

Snežana

Best Answer

Answers

  • kaymankayman Member Posts: 662 Unicorn
    edited December 2018
     Is your process itself also using UTF-8?
    When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding
  • s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II
    Dear @kayman ,

    Many thanks for your suggestion! However it didn't really help resolving my case :-(

    I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.

    In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:

    After Get Page:
    CVM GFI #108 Registering with CVM’€™€™\200\203€™s Electronic Submission System

    In the final Example Set:
    Registering with CVM’s Electronic Submission System

    Any idea what is still wrong?

    Many thanks in advance for any kind of input!

    Snezana




Sign In or Register to comment.