The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
UTF-8 encoded text doesn't get right out of the Get Page operator
s_nektarijevic
RapidMiner Certified Analyst, Member Posts: 12 Contributor II
Dear RapidMiners,
I am having an issue with the Get Page operator and UTF-8 encoding.
I am scraping the content of this web page:
According to the html code I get out of Get Page, this page uses UTF-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The problem is that for example: FDA’s turns out as FDAâs.
I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:
"Encoding 'SYSTEM' is not supported"
Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?
Many thanks in advance for any kind of input!
Snežana
Tagged:
0
Best Answer
-
Marco_Boeck Administrator, Moderator, Employee-RapidMiner, Member, University Professor Posts: 1,996 RM EngineeringHi,
This works just for me:<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div> <context></div><div> <input/></div><div> <output/></div><div> <macros/></div><div> </context></div><div> <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div> <parameter key="logverbosity" value="init"/></div><div> <parameter key="random_seed" value="2001"/></div><div> <parameter key="send_mail" value="never"/></div><div> <parameter key="notification_email" value=""/></div><div> <parameter key="process_duration_for_mail" value="30"/></div><div> <parameter key="encoding" value="SYSTEM"/></div><div> <process expanded="true"></div><div> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div> <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div> <parameter key="random_user_agent" value="false"/></div><div> <parameter key="connection_timeout" value="10000"/></div><div> <parameter key="read_timeout" value="10000"/></div><div> <parameter key="follow_redirects" value="true"/></div><div> <parameter key="accept_cookies" value="none"/></div><div> <parameter key="cookie_scope" value="global"/></div><div> <parameter key="request_method" value="GET"/></div><div> <list key="query_parameters"/></div><div> <list key="request_properties"/></div><div> <parameter key="override_encoding" value="true"/></div><div> <parameter key="encoding" value="UTF-8"/></div><div> </operator></div><div> <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div> <portSpacing port="source_input 1" spacing="0"/></div><div> <portSpacing port="sink_result 1" spacing="0"/></div><div> <portSpacing port="sink_result 2" spacing="0"/></div><div> </process></div><div> </operator></div><div></process></div><div></div>
5
Answers
When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding