The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Web crawling -overcome memory limit, split urls into susamples and then combine"

In777In777 Member Posts: 2 Learner III
edited June 2019 in Help
Hello,

I retrieve data from several web pages (>30000) with the "get pages" operator. I have imported all my urls to the repository from the excel file.  Then I process the information with regex (I extract several categories) and write the information about categories to excel in a separate raw for each url. My process works fine with small number of urls but my computer does not have enough memory to process all web pages at once. I would like to split them into pieces like 2000 urls each and do this process separately. At the end I will join excel files together. I looked at sampling operators, but most of them produce random sample. I want to keep the order in which the urls are crawled (if possible). I think I need to write a loop, but I cannot figure out where to start. For example I do not know which loop operator to use and how to make it to write several excel files or sheets with different names (1-x). Could anabody help me with that.
Tagged:

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    There's an operator named Loop Batches.  This should do what you need with outputting out the Excel files and then after the loop you'll be able to loop those Excel files & combine them. 

    May I suggest that rather than looping to write into an Excel file that you consider either a database (where you can append the new data) or Write CSV where you can also tick Append to File, this means that if your process stops at any point you can always pick up where it left off without restarting at the beginning of the file. 
Sign In or Register to comment.