The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[ALMOST SOLVED] Web Crawling and Text Editing challenge
Kind people of the rapid-i,
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:
I've got a website where there are some news and articles: (i.e. www.parolibero.it)
I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3. Export the data in a graphic format that would highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)
I've tried to use the web crawling but all I get is the home page in txt format and then the Excel with just one record.
Can you please help me? At least I would like to know where I get wrong or which operators to use for that.
Thank you very much indeed for your help!
Leon
P.S. There is no copyright issue at all as I am one of the staff of that website
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:
I've got a website where there are some news and articles: (i.e. www.parolibero.it)
I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3. Export the data in a graphic format that would highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)
I've tried to use the web crawling but all I get is the home page in txt format and then the Excel with just one record.
Can you please help me? At least I would like to know where I get wrong or which operators to use for that.
Thank you very much indeed for your help!
Leon
P.S. There is no copyright issue at all as I am one of the staff of that website
Tagged:
0
Answers
to extract information from the site you can for example use the Get Page Operator followed by Cut Documents and Extract Information, see here: One thing you have to notice is that for XPath every HTML identifier must have a 'h:' appended. Otherwise it won't work.
Best,
Nils