The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Scrape a website and download hyperlinked pdf files"

gary_molloygary_molloy Member Posts: 4 Learner II
edited June 2019 in Help

I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?

Tagged:

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Is the "Open File" operator not doing what you want?  It allows you to get files from any URL or file path and have them as a file object, which can then be stored.  If you have multiple files then you can use macros and put this in a loop.

    If you want to scrape actual web pages, then use "Get Page" or "Get Pages" instead.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @gary_molloy - if you use the "Crawl Web" operator (Web Mining extension), there is an option to "write pages to disk".  This will save the PDFs like normal.  I have done this many times.


    Scott

Sign In or Register to comment.