Problem with extensional Operator "Get Pages"
Hi,
I have a problem with the Operator "Get Pages" from Web Mining Extension.
It seems like that the operator is having a coding problem with UTF-8 charakters such aus "Ü".
With Mozilla Firefox I get a json-response with results after calling the URL "https://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5".
By calling this URL via Operator "Get Pages" I get a json-result but without an search-result.
Thats my test-process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="1"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-10.0"/>
<parameter key="attributes_upper_bound" value="10.0"/>
<parameter key="gaussian_standard_deviation" value="10.0"/>
<parameter key="largest_radius" value="10.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
<list key="function_descriptions">
<parameter key="att1" value=""https://itunes.apple.com/search?term=\"Google Übersetzer\"&entity=software&country=de&media=software&limit=5""/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPage" width="90" x="313" y="34">
<parameter key="link_attribute" value="att1"/>
<parameter key="page_attribute" value="html"/>
<parameter key="random_user_agent" value="false"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
<parameter key="connection_timeout" value="2000"/>
<parameter key="read_timeout" value="2000"/>
<parameter key="follow_redirects" value="true"/>
<parameter key="accept_cookies" value="none"/>
<parameter key="cookie_scope" value="global"/>
<parameter key="request_method" value="POST"/>
<parameter key="delay" value="random"/>
<parameter key="delay_amount" value="5000"/>
<parameter key="min_delay_amount" value="2000"/>
<parameter key="max_delay_amount" value="5000"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="getPage" to_port="Example Set"/>
<connect from_op="getPage" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Can you reproduce the issue and do you think that this is a bug of the operator or do I have to escape the url and if yes in which way?
Regards
Johannes
Best Answer
-
Edin_Klapic Employee-RapidMiner, RMResearcher, Member Posts: 299 RM Data Scientist
The link needs to be encoded as follows
https://itunes.apple.com/search?term="Google+%C3%9Cbersetzer"&entity=software&country=de&media=software&limit=5
My first suggestion %DC as encoding for the letter Ü is only partly correct - For UTF-8 ist needs to be %C3%9C.
You can test such URLencoding related stuff on various websites (e.g. here).
Best,
Edin
0
Answers
It's giving me a bad request (400) if I just plug in the URL into a single Get Page. I think it's Apple preventing people like use from using their stuff. Maybe @Edin_Klapic has an idea about this.
Hi Johannes,
I tried your URL with various RapidMiner Operators, which are
Get Pages, Get Page, Enrich Data by Webservice as well as Open File (from URL) in combination with Read Document.
None of them delivered the desired output. But I can confirm that I got the same result you did.
Regarding your Encoding question:
In your use case I tried to encode the part you mentioned - but this did not help
When I load the URL in my browser a .txt file is downloaded to my computer - I suspect the problem here.
If you can try this with a website where you only receive a JSON string as result we should get this going.
Best regards,
Edin
Hi,
Thanks a lot for your work!
I'm sorry for the late response. There was a mistake in my process. The user agent must be randomized. The following process shows my problem better.
You see that the process is working with in case of the second row. It is not working with the special charakter in the first row. So I still think that this is an encoding-problem in the implementation of the "Get Pages"-operator.
Best Regards,
Johannes
Thanks a lot. The solution is working!