"Web Crawler problem"

mmarag · April 2011

Hi all,

i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.

Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards

mmarag

haddock · April 2011

Hi there Mmarag,

For the future, if you paste the XML of your process it makes it easier to check, for the present the following code appears to work, so I ponder where the "serious bug" really lies.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true" height="454" width="812">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="111" y="242">
        <parameter key="url" value="http://www.opengov.gr/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".*gr.*"/>
          <parameter key="store_with_matching_url" value=".*gr.*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="output_dir" value="C:\Documents and Settings\Administrator.KNOWLEDG-P6715Y\My Documents"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="62" y="117">
        <parameter key="url" value="http://www.opengov.gr/home/"/>
        <list key="query_parameters"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <connect from_op="Get Page" from_port="output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

mmarag · April 2011

Dear Sir,

thank you very much for the rapid response.

Mmarag

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Web Crawler problem"

Answers