The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Is it possible to extract data from a list of URLs instead of first saving them?

LaserLaser Member Posts: 1 Learner III
edited November 2018 in Help
Hi,

I've recently discovered RapidMiner, and I'm very excited about it. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it  has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. I was excited when I read that the operator 'process documents from web' didn't need to store the html pages, but was dissapointed when it still needed to crawl itself. And it lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?

I've been reading the manual, and I've read several pages on this forum. (couldn't find an answer) I have also seen a fair amount of tutorials. I'm still unable to figure it out. I could share the process I have right now.. but Ive just been collecting the operators that look useful to me, and been unable to connect them succesfully. So it probably won't make much sense. Any help is much appreciated. Thanks in advance.

~ George
Sign In or Register to comment.