The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Web mining - crawling rules"

gingernissangingernissan Member Posts: 2 Contributor I
edited June 2019 in Help
Hi i am new to Rapid Miner. I have a site i want to craw and extract/download pages. The pages i am interested in have a common URL (http://items.mywebsite.ie/for-sale/laptops/3254621) . The starting URL i am using is the site search page containing the links to the relative pages (http://items.mywebsite.ie/find/for-sale/laptops/
My overall goal of this is to pull a list of say 20 pages in this relevant format. The number is the page id but it is not relevant to the laptop section, it is site wide.

I have tried several variations of the store_with_matching_url and Follow_link_with_matching_url in an attempt to follow links with the word laptop and then subsequently store the ones that have a 7 digit number at the end.

"http://items.mywebsitel.ie\for-sale\laptops\.+[0-9]"
'http://items.mywebsite.ie\for-sale\laptops\.+[0-9]'
(^)http://items.mywebsite.ie\for-sale\laptops\.+[0-9]($)
.+[0-9]
.+laptops.+
.+laptops.+|.+[0-9]
.[0-9][0-9][0-9][0-9][0-9][0-9][0-9]

Can anyone help me out of point me in the right direction?

Any help would be greatly appreciated, Thanks
Tagged:

Answers

  • gingernissangingernissan Member Posts: 2 Contributor I
    so with several more persistent hours i managed to figure it out using :
    store    .+for-sale/Laptops/.+
    follow    .+Laptops.+

    It's so obvious now, i should have got it earlier !
Sign In or Register to comment.