The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Mining-Crawler problem"
hi every one,
I am facing a problem while using crowler......i tried the following code.
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
<parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.
Can any one teel me wat the problem is .I am also attaching my logfiles also
P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
Root[1] (Process)
+- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes
(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)
P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds
I am facing a problem while using crowler......i tried the following code.
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
<parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.
Can any one teel me wat the problem is .I am also attaching my logfiles also
P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
Root[1] (Process)
+- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes
(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)
P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds
Tagged:
0
Answers
probably your crawling rules forbid the storing any page found. The parameter have the following meaning: For more information see http://nemoz.org/joomla/content/view/64/53/lang,de/
Greetings,
Sebastian
I tried with crawler for an intranet site, it is working fine.But when i am trying to crawl ,internet sites its giving me problem.
The user agent i am using is rapid-miner-crawler .For accessing intranet sites, do i hav to use any other useragents.
thank you for your quick replay.
greetings ,
Siju Sony Mathew
perhabs they forbid this type of user agent for their site, or even excluded crawlers in the robots.txt.
Greetings,
Sebastian
Is there any other user agent by which the crawler can access the Webpages.
greetings,
Siju
the parameter user_agent in the crawler speciefies the string used to authenticate the client to the http server. You might put in arbitrary values, for example the values for internet explorer, firefox or something else. If its your own webpage you could even turn of "obey_robot_exclusion", causing the crawler to igonore bans within the robots.txt. But do this only if its your own page!
Greetings,
Sebastian