Webcrawler Doubt

newbierapid · September 2011

Hi All,

I am using RM 5.1 and I am currently experimenting with web mining.My objective is to crawl a web page and display according to the crawling rules. After applying the crawling rules I am not able to see any output.

Appreciate help and thanks in advance.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<process expanded="true" height="503" width="604">
<operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="122" y="119">
<parameter key="url" value="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=/netahtml/PTO/search-adv.htm&r=0&p=1&f=S&l=50&Query=apple&d=PTXT"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".*(Apple)"/>
<parameter key="store_with_matching_content" value=".*(Apple"/>
<parameter key="follow_link_with_matching_text" value=".*(Apple"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="max_pages" value="5"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Thanks

colo · September 2011

Hi,

it seems there are some closing brackets missing for the last two rules.

There is one special thing to consider when using "store_with_matching_content": if you want the dot to match all symbols including line breaks, you have to activate the dot-all mode. This is possible by placing "(?s)" at the beginning of your expression. But this will make crawling slow, since whole webpages have to be scanned (see http://rapid-i.com/rapidforum/index.php/topic,2102.0.html).

Regards
Matthias

newbierapid · September 2011

Hi Mathias,

I have tried the the way you explained. Still i couldnt find the solution. Please find the xml code below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<process expanded="true" height="521" width="622">
<operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="210">
<parameter key="url" value="http://www.google.com/search?q=apple&btnG=Search+Patents&tbm=pts&tbo=1&hl=en"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_text" value="(?s).*apple.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Thanks

colo · September 2011

Hi,

you're right, this is simply not working. I also can't obtain any pages for both of the URLs you tried (even without crawling rules, which means that all links should be followed). I tried some smaller webpage instead and this is working. Maybe those big pages block the crawler somehow?
Certainly further investigation of the returned messages will be required, which means working with the source code. Or maybe I am also missing something necessary to get this working... Sorry.

Regards
Matthias

newbierapid · September 2011

Thanks Mathias,

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Webcrawler Doubt

Answers