Crawl Web user agent is always Java/1.7_11

n0n0 · May 2013

Hello to the community.

Trying to crawl a website, i always get the mobile version of this site.
I thougth about an issue with the user agent, so I changed the user agent parameter in the Crawl Web process, but still get the same result.
I then tried to crawl the page http://whatsmyuseragent.com/ with the following parameters:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://whatsmyuseragent.com/"/>
        <list key="crawling_rules"/>
        <parameter key="write_pages_into_files" value="true"/>
        <parameter key="add_pages_as_attribute" value="false"/>
        <parameter key="output_dir" value="C:\temp"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="0"/>
        <parameter key="domain" value="web"/>
        <parameter key="delay" value="1000"/>
        <parameter key="max_threads" value="1"/>
        <parameter key="max_page_size" value="100"/>
        <parameter key="user_agent" value="rapid-miner-crawler"/>
        <parameter key="obey_robot_exclusion" value="true"/>
        <parameter key="really_ignore_exclusion" value="false"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

and the C:\Temp\0.html file i get reports my user agent to be Java/1.7.0_11, and this whatever i set in the user agent field..

I'm on a Windows 8 x64 machine, using RapidMiner 5.3.008 and Web Mining Extension 5.3.0

Any advice?
Thank you
n0n0

Skirzynski · May 2013

Hey,

Seems to be a bug. For the "Get Page" operator it is working, but for "Crawl Web" and "Process documents from Web" not. I have created a ticket for this. We will come back to this thread once we have fixed this.

Thank you for reporting
Marcin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Crawl Web user agent is always Java/1.7_11

Answers