The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[Solved] Web Crawler Operator: Empty folder and results
Kate_Strydom
Member Posts: 19 Contributor II
Hi,
I have followed all the instructions with regards to http://auburnbigdata.blogspot.com/2013/04/web-crawling-with-rapidminer.html. My web crawler folder is empty. What am I doing wrong? The system times out at 42s. Has anyone had this problem after changing to .+auburnbigdata.+?
I have followed all the instructions with regards to http://auburnbigdata.blogspot.com/2013/04/web-crawling-with-rapidminer.html. My web crawler folder is empty. What am I doing wrong? The system times out at 42s. Has anyone had this problem after changing to .+auburnbigdata.+?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="447" y="75">
<parameter key="url" value="http://auburnbigdata.blogspot.com"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".+auburnbigdata.+"/>
<parameter key="store_with_matching_url" value=".+auburnbigdata.+"/>
</list>
<parameter key="output_dir" value="C:\Users\cec045\Desktop\CrawlData"/>
<parameter key="max_depth" value="10"/>
<parameter key="max_threads" value="2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
We do not really know what happened but it now works on our virtual machine setup although there still seems to be a problem still on RM on my pc.
An SA RM user suggested that we:
change the default max page size to 500.
Our server expert played around, then we changed the max threads to 4. Perhaps the crawler operator needs more threads, as my pc is limited to 2 threads.
We then tested it on a different website and I cannot wait to continue to learn the text processing part of RM.
I noticed that leaving the max pages blank means the crawler pulls everything. We first tested on max pages 20.