The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Crawl Web and generate reporting
pemguinkpl
Member Posts: 14 Contributor II
Hi,
i have try the crawl web process, but the result showed no have any document i have crawled. May i know what is the problem?
I follow exactly the step from the video below, but encounter the problem.
http://www.youtube.com/watch?v=zMyrw0HsREg
Any help please... :-\
How to use the generate report n report operation in rapid miner?
Anyone know???
Thank You!
i have try the crawl web process, but the result showed no have any document i have crawled. May i know what is the problem?
I follow exactly the step from the video below, but encounter the problem.
http://www.youtube.com/watch?v=zMyrw0HsREg
Any help please... :-\
How to use the generate report n report operation in rapid miner?
Anyone know???
Thank You!
Tagged:
0
Answers
I didn't watch the video and don't have the time to. Could you please post your process and describe more specifically what you are trying to do?
Best regards,
Marius
my initially research is to analyze H1N1 news and using crawler to get all the news about h1n1. This is the link i try to crawl
http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00%2B08:00&;max-results=20
but then i can't get any document.
This is my process xml:
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<process expanded="true" height="386" width="547">
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
<parameter key="url" value="http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00+08:00&amp;max-results=20"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+suiteid.+"/>
<parameter key="follow_link_with_matching_url" value=".+pagenum.+|.+suiteid.+"/>
</list>
<parameter key="output_dir" value="D:\FYP\result\test\crawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="1"/>
<parameter key="delay" value="500"/>
<parameter key="max_threads" value="4"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.27 Safari/532.0"/>
</operator>
<operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
<parameter key="repository_entry" value="../new"/>
</operator>
<operator activated="false" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="30">
<parameter key="create_word_vector" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<list key="specify_weights"/>
<process expanded="true" height="396" width="709">
<operator activated="false" class="text:extract_information" compatibility="5.1.004" expanded="true" height="60" name="Extract Information" width="90" x="113" y="89">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="h1n1" value="(h1n1\W+(?:\w+\W+){1,5}?influenzah1n1)"/>
<parameter key="influenza" value="(influenzah1n1\W+(?:\w+\W+){1,5}?)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="300">
<parameter key="repository_entry" value="../new"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
May i know what is the problem? Thanks
the problem is that the page you are trying to crawl does not allow to be crawled, and of course RapidMiner obeys this exclusion by default. The crawl operator has to options to ignore the so called robot exclusion, but as it says in the documentation, you are usually not allowed to disable it for pages which are not your own. These are the parameters:
obey robot exclusion: Specifies whether the crawler obeys the rules, which pages on site might be visited by a robot. Disable only if you know what you are doing and if you a sure not to violate any existing laws by doing so. Range: boolean; default: true
really ignore exclusion: Do you really want to ignore the robot exclusion? This might be illegal. Range: boolean; default: false
Best,
Marius
thank you for the replies, it's solved my problem