The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Extracting text from a website using an
geschwader
Member Posts: 16 Contributor II
Hello. I'm newbie with RapidMiner and I want to analyse text of certain parts of web-pages. From news pages I want to extract title, main text and date. Text and title must be cleaned from html and all other tags, date must be kept in date data format. Is it possible? I tryed "get pages" and "extract information" operators, but the latter keeps the whole text and the parts needed as attributes, so I can't use HTML processing operator to those attributes.
So, I'm stuck with this (just random example with BBC news site):
So, I'm stuck with this (just random example with BBC news site):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
<process expanded="true" height="460" width="681">
<operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="63" y="145">
<parameter key="url" value="http://www.bbc.co.uk/news/uk-12778022"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="209" y="147">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Title" value="<h1 class="story-header">(.*?)</h1>"/>
<parameter key="Story" value="<p class="introduction" id="story_continues_1">(.*?)</div><!-- / story-body -->"/>
<parameter key="Date" value="<span class="date">(.*?)</span>"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Extract Information" to_port="document"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
best regards,
andré
1. I have to analyse news sites like http://www.bbc.co.uk/ From the news pages like http://www.bbc.co.uk/news/uk-12778022 I want to extract story title, story main text and story date.
2. To do this I use Crawl Web and Extract Information operators. I use "Regular Expression" query and it extracts the information I need, so I don't need xpath. On the page http://www.bbc.co.uk/news/uk-12778022 the date is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <meta name="OriginalPublicationDate" content="2011/03/17 20:06:08"/>), title is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <h1 class="story-header">Japan crisis: UK rescue team to withdraw</h1>), the main text of the story is extracted with a query <p class="introduction" id="story_continues_1">(.*?)</div><!-- / story-body -->.
3. Now, the problem is with the latter: it extracts the text full of tags garbage. It looks like this: http://usic.org.ua/upload/151029c6dbdb05409ca8506de206cb60485a0c27/Code.txt. I want to clean the main text, but HTML processing operator doesn't work with data attributes. I tried data to documents, but it didn't work: it created Documents Collection IOO object, which is again not acceptable for HTML processing.
So, the question is: how to transform extracted data to documents, which can be processed as "normal" text documents like TXTs from hard drive? And then combine them again to data set.
Any ideas appreciated.
Which operator do you mean by "HTML processing operator"? After using "Extract Information" you should still have a document, no example set data. I did a lot of similar web mining tasks and always converted the document to data (via "Documents to Data" or "Process Documents") after the necessary steps since the document type has some limitations for processing. Then I use the "Replace" operator to filter out HTML tags. A very simple solution can be achieved by a regular expression like "<^[>]*>" which shall be removed (replaced by nothing). I use a filter chain of replace operators to cover some special cases and use the whole filter process as operator (Execute process) every time I need HTML filtering.
Regards
Matthias
However, <^[>]*> didn't work: it leaved, for instance, all <p></p> tags. Sometimes pages have lots of tags, so I'm still wondering, if I could use the predefined filtering of the "Extract Content" operator in this case.
If you insist on using the "Extract Content" operator you could convert the example set back to documents as you described before. If you use the expert parameter "select attributes and weights" for the "Data to Documents" operator you can simply choose only the attribute containing your story text to avoid a whole collection of documents for all your attributes. This single document should work with the desired operator. Without having a look inside I would guess "Extract Content" also uses regular expressions. In my opinion you don't really need the conversion back to a document to achieve this.
Regards
Matthias