Read full article RSS feeds with RapidMiner and a free API

SGolbert · August 2018

Hi RapidMiners!

I wanted to share a process that I use to get full articles out of RSS feeds. It uses Python's Beautiful Soup and a web API called Mercury Postlight.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="Read RSS Feed" width="90" x="112" y="34">
        <parameter key="url" value="https://www.presseportal.de/rss/polizei/laender/9.rss2"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="514" y="34">
        <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;import requests&#10;&#9;from bs4 import BeautifulSoup&#10;&#9;import json&#10;&#9;&#10;&#9;headers = {&quot;Content-Type&quot;: &quot;application/json&quot;,&#10;&#9;            &quot;x-api-key&quot;: &quot;GET YOUR OWN!&quot;&#10;&#9;            }&#10;&#9;&#10;&#9;results = []&#10;&#9;for address in data.Link:&#10;&#9;&#9;url = 'https://mercury.postlight.com/parser?url=' + address&#10;&#9;&#9;&#10;&#9;&#9;for dummy in range(10):&#10;&#9;&#9;&#9;try:&#10;&#9;&#9;&#9;&#9;response = requests.get(url, headers = headers)&#10;&#9;&#9;&#9;&#9;break&#10;&#9;&#9;&#9;except:&#10;&#9;&#9;&#9;&#9;continue&#10;&#9;&#9;&#10;&#9;&#9;html = json.loads(response.content)&#10;&#9;&#9;html = html['content']&#10;&#9;&#9;&#10;&#9;&#9;soup = BeautifulSoup(html, &quot;lxml&quot;)&#10;&#9;&#9;text = soup.get_text()&#10;&#9;&#9;text = text.replace('\n', ' ')&#10;&#9;&#9;results.append(text)&#10;&#9;&#10;&#9;data['main_text'] = results&#10;&#9;return data"/>
      </operator>
      <connect from_op="Read RSS Feed" from_port="output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Considering that there are comercial products that do the same, I think it is a valuable resource! The limit of API calls is however limited, so take it into account. It's speed is also much lower than using web scraping alternatives.I hope you enjoy it!

sgenzer · August 2018

this is GREAT, @SGolbert! Can I put this on the community repo (with full credit to you of course)?

SGolbert · August 2018

Yes, sure!

sgenzer · September 2018

DONE! You can find the process here.

Scott

SGolbert · March 2019

Little update on the process: the code for the mercury API has been open sourced!

You can find it under

https://github.com/postlight/mercury-parser

and use it in your own server, possibly making it a lot faster.

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Read full article RSS feeds with RapidMiner and a free API

Comments