The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Read full article RSS feeds with RapidMiner and a free API
SGolbert
RapidMiner Certified Analyst, Member Posts: 344 Unicorn
Hi RapidMiners!
I wanted to share a process that I use to get full articles out of RSS feeds. It uses Python's Beautiful Soup and a web API called Mercury Postlight.
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="Read RSS Feed" width="90" x="112" y="34">
<parameter key="url" value="https://www.presseportal.de/rss/polizei/laender/9.rss2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="514" y="34">
<parameter key="script" value="import pandas # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): 	import requests 	from bs4 import BeautifulSoup 	import json 	 	headers = {"Content-Type": "application/json", 	 "x-api-key": "GET YOUR OWN!" 	 } 	 	results = [] 	for address in data.Link: 		url = 'https://mercury.postlight.com/parser?url=' + address 		 		for dummy in range(10): 			try: 				response = requests.get(url, headers = headers) 				break 			except: 				continue 		 		html = json.loads(response.content) 		html = html['content'] 		 		soup = BeautifulSoup(html, "lxml") 		text = soup.get_text() 		text = text.replace('\n', ' ') 		results.append(text) 	 	data['main_text'] = results 	return data"/>
</operator>
<connect from_op="Read RSS Feed" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Considering that there are comercial products that do the same, I think it is a valuable resource! The limit of API calls is however limited, so take it into account. It's speed is also much lower than using web scraping alternatives.I hope you enjoy it!
Tagged:
3
Comments
this is GREAT, @SGolbert! Can I put this on the community repo (with full credit to you of course)?
Yes, sure!
DONE! You can find the process here.
Scott