The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"WEB crawler rules"
Hi!
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
Tagged:
0
Answers
on a quick check I got some pages with the following settings:
url: http://www.realestate-slovenia.info/
both rules: .+id.+
And I also increased the max page size to 10000.
As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?
Best regards,
Marius
the web page allows robots.
Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.
Tnx for helping.
Best regards,
Marius
I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.
This is my Web crawler process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
<parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&]pg=.+ | id=.+)"/>
<parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
</list>
<parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="4"/>
<parameter key="domain" value="server"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
As you can see I try to follow 3 types of URL, for example
http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923
And I want to store only one type of URL
http://www.realestate-slovenia.info/nepremicnine.html?id=5469846
So for the first task my rule is
http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)
Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+
Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.
I would really like this process to work cause gathered data are the basis for my article.
Tnx.