Webscraping JSON Content With RapidMiner
I am a student, RapidMiner novice and I want to scrape from a site that publishes customer reviews. But I cannot get this to work in RapidMiner. Here’s an example of the first webpage:
RapidMiner can pick up everything at the top and bottom of the pages but the actual review text and associated attributes are stored in JSON which the RapidMiner processes just refuse to pick up. No matter whether I use ‘Get Page(s)’ or ‘Crawl Web’ operators, it doesn’t scrape that part of the page. Have you ever dealt with this before?
The page seems to require a token. The JSON file seems to be dynamically created.
How do I authenticate?
Where do I get a token?
Where do I put it?
How do I get the JSON content?
Please and thanks
A very simple example process follows:
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<parameter key="logfile" value="C:\Users\AHQ08\Desktop\Unum Reviews\MyLog.log"/>
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="179" y="289">
<parameter key="url" value="https://www.unum.com/employees/benefits/disability-insurance/long-term-disability-insurance?bvstate=pg:1/ct:r#"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="289">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
MarcoBarradas Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, Member Posts: 272 UnicornI guess the JSON is pulled from an api instead of a normal www like the one I posted. In case you are suposed to get them from the webpage you should do it by putting the request properties.
Or maybe you could use enrich by web service.
You may follow @sgenzer post on how to use them
https://community.rapidminer.com/discussion/35280/how-to-interact-with-google-cloud-apis-with-the-web-mining-extension
Or
https://community.rapidminer.com/discussion/comment/41800#Comment_41800
5
Answers
https://api.bazaarvoice.com/data/display/0.2alpha/product/summary?PassKey=caMpDRdDUtaeikkWiWN5lpY1kmrXC9rPo1hDbuQ1Ne9d4&productid=2&contentType=reviews,questions&reviewDistribution=primaryRating,recommended&rev=0&contentlocale=en,en_US
You can inspect what is loaded while you access a webpage by using de developer tools on chrome.
Access with your web browser since I guess the passkey will be dead by the time you see this post.