The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] RM5 does not store the pages according to the specified rules...
I am trying of crawl an online newspaper. I specified rules for navigating trough the previous editions, and I need to store only the individual news (matching_url = .+deportes/8.+), not the index pages where they are listed...
Thank you in advance.
Leonardo Der Jachadurian Gorojans
<?xml version="1.0" encoding="UTF-8" standalone="no"?>But this does not work... Please, see the log... Nothing is stored...
<process version="5.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
<process expanded="true" height="-20" width="-50">
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="84" y="53">
<parameter key="url" value="http://www.pagina12.com.ar"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".+principal/index.+|.+deportes/index.+|.+deportes/8.+"/>
<parameter key="store_with_matching_url" value=".+deportes/8.+"/>
</list>
<parameter key="output_dir" value="C:\Users\USR\Desktop\FILES"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="9999999"/>
<parameter key="domain" value="server"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Apr 1, 2012 11:36:36 PM INFO: Process //NewLocalRepository/Pruebas/Crawler startsWhat can be wrong? There is a bug in RM5's WebCrawler? Or I am doing some wrong?
Apr 1, 2012 11:36:36 PM INFO: Loading initial data.
Apr 1, 2012 11:36:37 PM INFO: Discarded page "http://www.pagina12.com.ar" because url does not match filter rules.
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190886-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190902-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190872-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190897-2012-04-01.html
Apr 1, 2012 11:36:39 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html" because url does not match filter rules.
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190801-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190811-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190840-2012-03-31.html
Apr 1, 2012 11:36:42 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html" because url does not match filter rules.
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190725-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190718-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190743-2012-03-30.html
Apr 1, 2012 11:36:44 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html" because url does not match filter rules.
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190641-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190635-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190650-2012-03-29.html
Apr 1, 2012 11:36:47 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html" because url does not match filter rules.
bla, bla, bla...
bla, bla, bla...
bla, bla, bla...
Thank you in advance.
Leonardo Der Jachadurian Gorojans
0
Answers
try to reduce the max depth and/or adjust your FOLLOW rules. The operator first descends, and on its way back up from the recursion stores the pages. You seem to recurse (almost) indefinitely deep. That could imply an error in your FOLLOW rule. Reducing the max. depth however can also help.
Best, Marius
This online newspaper has a lot of circular link paths. I have rearranged the RM crawl process in order to make the date-navigation iterative with the loop operator, and from every index (of every date) I use the webcrawler to get the individual news pages. See here, please... Thank you.
Best regards, Leonardo
what's the problem with that process? Without looking at it in detail I saw that it grabs some pages...
If it does not what you expected, did you try some of my hints in my first post? With those I got your first process working.
Best,
Marius
I followed your recommendations, adjusting crawling rules (I changed to be more specific to avoid unwanted paths), and trying several depths (from 0 to 9, 10, 20, 50, 99, 999, 999) and it does not work as I need.
What I need is crawl every pages about "Deportes", from today to several years ago (say 5 years), in this online newspaper (Pagina12.com)
With high depths (>=99), I have reached a top of 1062 stored pages and after this, the process stops without errors. With a depth of 9, I only get 96 pages stored...
The solution that I posted, was able to obtain all the pages.
Where can I get a more detailed documentation about RM webcrawler? (or the documentation about the library that this node uses)
Thank you.