Webscraping JSON Content With RapidMiner

B00100719 · November 2018

I am a student, RapidMiner novice and I want to scrape from a site that publishes customer reviews. But I cannot get this to work in RapidMiner. Here’s an example of the first webpage:

https://www.unum.com/employees/benefits/disability-insurance/long-term-disability-insurance?bvstate=pg:1/ct:r

RapidMiner can pick up everything at the top and bottom of the pages but the actual review text and associated attributes are stored in JSON which the RapidMiner processes just refuse to pick up. No matter whether I use ‘Get Page(s)’ or ‘Crawl Web’ operators, it doesn’t scrape that part of the page. Have you ever dealt with this before?

The page seems to require a token. The JSON file seems to be dynamically created.

How do I authenticate?

Where do I get a token?

Where do I put it?

How do I get the JSON content?

Please and thanks

A very simple example process follows:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">

</context>

</operator>

</process>

</operator>

</process>

</operator>

</process>

MarcoBarradas · November 2018

I guess the JSON is pulled from an api instead of a normal www like the one I posted. In case you are suposed to get them from the webpage you should do it by putting the request properties.
Or maybe you could use enrich by web service.
You may follow @sgenzer post on how to use them
https://community.rapidminer.com/discussion/35280/how-to-interact-with-google-cloud-apis-with-the-web-mining-extension
Or
https://community.rapidminer.com/discussion/comment/41800#Comment_41800

B00100719 · November 2018

Thanks. I am a little bit further as I have a passkey now. But how do I get RapidMiner to use it? I have been expecting the JSON elements thru Chrome all along - the issue is getting RapidMiner to pull them! The 'Get Page' operator is pulling the web page, minus the JSON, which is the bit I want. The 'Crawl Web' operator allows you to specify a username and password, but there is no parameter that allows you to pass a PassKey

MarcoBarradas · November 2018

It seems that the JOSN is at
https://api.bazaarvoice.com/data/display/0.2alpha/product/summary?PassKey=caMpDRdDUtaeikkWiWN5lpY1kmrXC9rPo1hDbuQ1Ne9d4&productid=2&contentType=reviews,questions&reviewDistribution=primaryRating,recommended&rev=0&contentlocale=en,en_US

You can inspect what is loaded while you access a webpage by using de developer tools on chrome.
Access with your web browser since I guess the passkey will be dead by the time you see this post.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Webscraping JSON Content With RapidMiner

Best Answer

Answers