The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Webcrawling only works for some sites
I'm trying to get data from various websites (mostly with ads), so I'm trying out RapidMiner's web crawler function. I've successfully downloaded from wikipedia.org , google.com and some more, just for practicing purposes. But it seems that there are many sites from which I can't get any data. For example I can't get RapidMiner to crawl gumtree.com/property-for-sale . I noticed that web crawling is disliked by many, despite my good intentions. So I was thinking it was because of robot exclusions, but as you can see in the code below, that was not the problem. I also changed the name of the user agent to "Firefox", and played around with the other parameters. When it works it takes seconds or minutes to finish the task, and generates a bunch of neatly arranged txt files in the specified folder. When it doesn't work I don't get any error, but rather a message declaring that "New results were created". However, it finishes in 0 seconds, and no files are to be seen anywhere. Why doesn't the web crawler work for some sites (the ones with juicy data) and how can I make it work?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Thanks!
<process version="5.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
<process expanded="true" height="145" width="212">
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
<parameter key="url" value="http://www.gumtree.com/property-for-sale"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+t.+"/>
<parameter key="follow_link_with_matching_url" value=".+t.+"/>
</list>
<parameter key="output_dir" value="/home/aqil/RapidMiner/rapidminer/repository/Test"/>
<parameter key="max_depth" value="1"/>
<parameter key="user_agent" value="Firefox"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
I had the same problem and that's a size matter.
you should incrise the size of your "max page size" parameter
so that you could get your pages
cheers