[SOLVED] Crawl Web and generate reporting

pemguinkpl · December 2011

Hi,

i have try the crawl web process, but the result showed no have any document i have crawled. May i know what is the problem?
I follow exactly the step from the video below, but encounter the problem.

http://www.youtube.com/watch?v=zMyrw0HsREg

Any help please... :-\

How to use the generate report n report operation in rapid miner?
Anyone know???

Thank You!

MariusHelf · December 2011

Hi,

I didn't watch the video and don't have the time to. Could you please post your process and describe more specifically what you are trying to do?

Best regards,
Marius

pemguinkpl · December 2011

hi marius thanks for replied,

my initially research is to analyze H1N1 news and using crawler to get all the news about h1n1. This is the link i try to crawl

http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00%2B08:00&;max-results=20

but then i can't get any document.

This is my process xml:

<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<process expanded="true" height="386" width="547">
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
<parameter key="url" value="http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00+08:00&amp;max-results=20"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+suiteid.+"/>
<parameter key="follow_link_with_matching_url" value=".+pagenum.+|.+suiteid.+"/>
</list>
<parameter key="output_dir" value="D:\FYP\result\test\crawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="1"/>
<parameter key="delay" value="500"/>
<parameter key="max_threads" value="4"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.27 Safari/532.0"/>
</operator>
<operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
<parameter key="repository_entry" value="../new"/>
</operator>
<operator activated="false" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="30">
<parameter key="create_word_vector" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<list key="specify_weights"/>
<process expanded="true" height="396" width="709">
<operator activated="false" class="text:extract_information" compatibility="5.1.004" expanded="true" height="60" name="Extract Information" width="90" x="113" y="89">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="h1n1" value="(h1n1\W+(?:\w+\W+){1,5}?influenzah1n1)"/>
<parameter key="influenza" value="(influenzah1n1\W+(?:\w+\W+){1,5}?)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="300">
<parameter key="repository_entry" value="../new"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

May i know what is the problem? Thanks

MariusHelf · January 2012

Hi,

the problem is that the page you are trying to crawl does not allow to be crawled, and of course RapidMiner obeys this exclusion by default. The crawl operator has to options to ignore the so called robot exclusion, but as it says in the documentation, you are usually not allowed to disable it for pages which are not your own. These are the parameters:

obey robot exclusion: Specifies whether the crawler obeys the rules, which pages on site might be visited by a robot. Disable only if you know what you are doing and if you a sure not to violate any existing laws by doing so. Range: boolean; default: true
really ignore exclusion: Do you really want to ignore the robot exclusion? This might be illegal. Range: boolean; default: false

Best,
Marius

pemguinkpl · January 2012

HI marius,

thank you for the replies, it's solved my problem

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Crawl Web and generate reporting

Answers