The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"First Steps in Webmining"
So I decided to get a bit deeper into rapidminer and defined my first challenge.
I want the crawler to get every posting of a blog which mentions a certain word:
First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
<parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?
I want the crawler to get every posting of a blog which mentions a certain word:
First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
<parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?
Tagged:
0
Answers
You can get the fixed version via CVS (the bug was in the text plugin, formerly known as "wvtool", hence the module name) and the bugfix will of course also be part of the next release.
Cheers,
Ingo
that's why even textinput also provides tutorial for using httrack. ;D
Did you know "webharvest" ( http://web-harvest.sourceforge.net/ , I do not remember if I have already talked of that )? It is a kind of high level scripting language that looks like XML, and aimed at specifying which type of harvesting task you want to perform. Assuming that you could call a WebHarvest script from RapidMiner, you could do exactly what you want...
@Ingo & Steffen :
May a "scripting box" for Webharvest be an interesting feature request ?
Cheers,
Jean-Charles.
hi
i have a question how to get data from web sites behind the hyperlinks. on web page how we get data from hyperlinks which are used on every web page.
Did you try this example? http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480