The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Webmining: need help for webcrawling with
Hello community members,
I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.
I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).
I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.
I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).
Typically, the location can be accessed via a link; Room from; Roomto; buy or rent; the order of the sorter:
immowelt.de/liste/muenchen/wohnungen/kaufen?roomi=2&rooma=2&sort=relevanz
The properties displayed are then listed, the link is made up of the constant expose and the ID of the offer, see below:
immowelt.de/projekte/expose/k2rb332
With the "web crawl" operator it would be easy, one would simply give the statement "expose" as a parameter for the crawl
How about "get pages" and "loop"? The ID doesn't count up, I would be very grateful if you could help me.
I wish you and your families a nice weekend
Regards
TB161
I wish you and your families a nice weekend
Regards
TB161
Tagged:
0
Answers
Crawl first page and extract next to your regular content also the indicator for te amount of pages.
For your example this would be
8 Objekte zum Kauf (insgesamt 141 Wohneinheiten im Projekt)
So we know there are 8 in total, and the site shows 6 on a page so we can create a macro that stores our pages (ceiling of 8 divided by 6 gives 2 pages)
Next you need to do some reverse page engineering to understand how a website moves from one page 2 another. If you are lucky it's something like mysite.com/page?nextpage=2 so you create a loop flow where you crawl the page but increment the page parameters each time so like
mysite.com/page?nextpage=3
mysite.com/page?nextpage=4
...
Till the last page you need
Now, your page seems to load dynamically (not moving to a new page but just adding on the previous load) so it's not straight forward in this case. You'll probably need to look at the page load sequence (using Google inspect - network) to see which page is loaded behind the scenes.
Hope this gets you started
thank you for your suggestions...I tried it the last days, but unfortunately my experience is limited.
Therefore I use Parsehub for crawling, the rest I will do in redmine.
Thanks for your support !!
regards TB
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
good idea, this could fly....but isn't it hat the html have only the "first" page...?
When I teh results have several pages, I don't know how to crawl them.
Regards
TB
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts