The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Webcrawler download images?
mrbigglesworth
Member Posts: 1 Learner III
Hi,
I haven't posted on this forum before but I've been using Rapidminer recently to do webcrawling and run some analysis on the websites. In certain cases, I would like to use rapidminer's webcrawler save the full page rather than just the html. Specifically, the page may have some jpg files, and I'd like to archive those in certain cases. Is there an easy way to do this (other than writing a custom groovy script)?
Thanks!
PS - Thank you very much for the great support and great program. Rapidminer keeps getting better.
I haven't posted on this forum before but I've been using Rapidminer recently to do webcrawling and run some analysis on the websites. In certain cases, I would like to use rapidminer's webcrawler save the full page rather than just the html. Specifically, the page may have some jpg files, and I'd like to archive those in certain cases. Is there an easy way to do this (other than writing a custom groovy script)?
Thanks!
PS - Thank you very much for the great support and great program. Rapidminer keeps getting better.
0
Answers
as far as I know the embedded WebSphinx crawler doesn't support storing additional media. But better check the documentation to be sure: http://www.cs.cmu.edu/~rcm/websphinx/
There was a discussion about replacing this ancient crawler some time ago. I would also be glad if a better crawler was available in RapidMiner, but since there is one that is working certainly other things have priority. Maybe I will integrate another crawler after hopefully finishing my thesis this month, if I start missing the development of RapidMiner add-ons
Regards
Matthias