The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Web Mining: Crawl Web works or not - depending on site, bug or feature?
Is there known bugs in Web Mining: Crawl Web procedure? I have noticed several forum threads in web asking same question - but no answers.
Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not.
1. This works:
2. But this does not although the logic is very same:
Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?
Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not.
1. This works:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
<parameter key="url" value="http://uta.fi"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".*tutkimus.*"/>
<parameter key="follow_link_with_matching_url" value=".*tutkimus.*"/>
</list>
<parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="100"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
2. But this does not although the logic is very same:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>I wonder why? Indeed, is there any way to see a bit more details - step-by-step what is the operator doing when parsing the page? So that you could maybe found out the reason by yourself?
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
<parameter key="url" value="http://kaksplus.fi/keskustelu/plussalaiset/mitas-nyt"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".*keskustelu.*"/>
<parameter key="follow_link_with_matching_url" value=".*keskustelu.*"/>
</list>
<parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="100"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?
0
Answers