Crawling Amazon for Review Text
Hello -
I am looking to better understand how to use the "Crawl Web" operator to pull review text from Amazon.
I have looked through a few posts but nothing seems to be getting at exactly what I am looking for. The goal would be to use Amazon's link structure to scan all of the reviews for a given product.
Below is the basic link structure for getting reviews for item "B019XFKM3". The only thing that needs to be changed or looped within the link is ..."paging_btm_1?" and "reviews&pageNumber=1". When changing the numbers to 1, 2, 3.... we would be able to scan that page of reviews.
How would I be able to set this up using the "Web crawl" operator and further, how would I be able to just pull the review text and star rating?
Any help would be greatly appreciated.
Best,
Dan
Answers
The "Crawl Web" operator has the option to add crawling rules in the parameters. You basically need to set up rules that correspond to your root URL and then use regular expressions to define the possible variations (like the final ...reviews&pageNumber=x portion of the URL). This is a very typical use case for the operator and with a bit of trial-and-error you should be able to get it performing as you wish. You'll also want to look at the crawl depth parameter as well, which will control how many successive pages it should take.
As far as saving only certain elements from the resulting page (like the text and rating), that can be quite a bit more complicated. You'll probably end up with some combination of Cut Document and Extract Content and then you'll need to Process Documents later to tokenize the review text, etc. The exact configuration of those operators is highly dependent on the data retrieved from the page, so you may also need to get creative with text searching or regular expressions to keep only the pieces that you want.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hello @dhunnewe - welcome to the community. So I did some searching and could not find the link BUT I am 99.9% sure that "scraping" amazon.com is against their Terms of Service. Hence using an operator like "Crawl Web" would violate their policy and, hence, I cannot really help you use this operator to do what you want.
That said, you can accomplish what you want in a better, and legal, way using their Product Advertising API. As others know on this forum, I am a huge REST API advocate and use them all the time with either the "Enrich Data via Webservice" operator or other methods. I would strongly suggest that you try going this route.
Scott