The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Advanced web crawling question
I am working on a project that requires me to web crawl macys.com. What I would like to do is bring back reviews from this website, but have it group the reviews based on similar topics. Is this something I can do using web crawling along with some possible grouping functionality? Just wondering if RapidMiner can give me that level of detail.
Thank you
Tagged:
0
Answers
Yes. You'll need the Web and Text Mining extension and will probably scrap those reviews using XPath.
Perfectly possible but since the website is very dynamic and using very dirty code it is not very straightforward. Going through the whole process might take a bit too much time but this should get you started already :
What does this do ?
-> it starts with one single url and cleans the dirty code so only the relevant page blocks remain. Typically you should use the HTML to XML convertor but given the quality of the source code this is not a good option now. The cleanup process works fine for this specific page, you may need to add some if you test other pages.
-> next there is a simpe xslt applied to get all the products that have reviews , and the document data is converted to an example set. In the sample it will contain the product name, product page, review qty and rating, but of course you can add whatever you want using the same logic.
From here onwards is basically looping the examples. For each url you apply the same logic, open the page, clean out all the rubish so you are only left with the reviews and so on. The example here is focussing on one single collection page, but you can retrieve the amount of pages for a given category also using xpath, use that in a loop logic again and so you can travel through all the pages.
So your final flow could look like this :
-> create a csv with starting pages (your categories)
-> loop through these one by one, get the number of pages for each category and use this as a loop variable to get all the product collection pages.
-> for every product page that has reviews get the final page (actual single product page)
-> Get the reviews, store them and play around
-> back to start
Hope this helps
Wow @kayman, great sample process!
The only thing to add here is that you can enhance this process by using either a "get pages" and utilizing a txt file of the specific links that you want to retrieve, or you can use the "crawl web" or "process documents from web" operators and specify a set of automatic crawling rules. That should make it a bit easier to cycle through a lot of different categories/pages in the site.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
True, but it might not work pretty well with this specific website given the high level of dynamic code and redirections behind the scenes, the structure of the page makes my eyes bleed tbh :-)
Therefore the risk is pretty high original poster would get lost or get nothing when relying on crawling rules for this site. But of course for any 'normal' site these are the first operators to look at.