"Extracting text from a website using an

geschwader · March 2011

Hello. I'm newbie with RapidMiner and I want to analyse text of certain parts of web-pages. From news pages I want to extract title, main text and date. Text and title must be cleaned from html and all other tags, date must be kept in date data format. Is it possible? I tryed "get pages" and "extract information" operators, but the latter keeps the whole text and the parts needed as attributes, so I can't use HTML processing operator to those attributes.
So, I'm stuck with this (just random example with BBC news site):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
    <process expanded="true" height="460" width="681">
      <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="63" y="145">
        <parameter key="url" value="http://www.bbc.co.uk/news/uk-12778022"/>
        <parameter key="random_user_agent" value="true"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="209" y="147">
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries">
          <parameter key="Title" value="&lt;h1 class=&quot;story-header&quot;&gt;(.*?)&lt;/h1&gt;"/>
          <parameter key="Story" value="&lt;p class=&quot;introduction&quot; id=&quot;story_continues_1&quot;&gt;(.*?)&lt;/div&gt;&lt;!-- / story-body --&gt;"/>
          <parameter key="Date" value="&lt;span class=&quot;date&quot;&gt;(.*?)&lt;/span&gt;"/>
        </list>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Extract Information" to_port="document"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

andk · March 2011

look at this topic, it might help you: http://rapid-i.com/rapidforum/index.php/topic,3444.0.html. to give you an advice on how to parse out the information you want it might help if you give us the xml/html code of one example document. if the documents you are interested in are xml you may want to choose xpath query otherwise it would use string matching. if you use the search function of this forum there is topic where the detailed process is posted how to read song and artist iformation from a playfm playlist. you could take this process and modify it for your needs.

best regards,

andré

geschwader · March 2011

andk wrote:

look at this topic, it might help you: http://rapid-i.com/rapidforum/index.php/topic,3444.0.html.

OK, that's helped in a way. At least I started to work on date extraction. Thanks. However, even here I continue to have problems: I have a string "2011/03/17 20:06:08", use parsing format "yyyy/mm/dd hh:mm:ss" and have the result "17 January 2011 20:06:08 EET". WTF? Why January?

to give you an advice on how to parse out the information you want it might help if you give us the xml/html code of one example document.

Well, the url of an example document is given in the code I've attached to my previous post. If you go to this web-page and press Ctrl+U, you'll see the code of the page. But OK, maybe I tell my problem in a kind of messy way. Sorry for that. Now I'll try to explain more thoroughly.
1. I have to analyse news sites like http://www.bbc.co.uk/ From the news pages like http://www.bbc.co.uk/news/uk-12778022 I want to extract story title, story main text and story date.
2. To do this I use Crawl Web and Extract Information operators. I use "Regular Expression" query and it extracts the information I need, so I don't need xpath. On the page http://www.bbc.co.uk/news/uk-12778022 the date is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <meta name="OriginalPublicationDate" content="2011/03/17 20:06:08"/>), title is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <h1 class="story-header">Japan crisis: UK rescue team to withdraw</h1>), the main text of the story is extracted with a query <p class="introduction" id="story_continues_1">(.*?)</div>.
3. Now, the problem is with the latter: it extracts the text full of tags garbage. It looks like this: http://usic.org.ua/upload/151029c6dbdb05409ca8506de206cb60485a0c27/Code.txt. I want to clean the main text, but HTML processing operator doesn't work with data attributes. I tried data to documents, but it didn't work: it created Documents Collection IOO object, which is again not acceptable for HTML processing.

So, the question is: how to transform extracted data to documents, which can be processed as "normal" text documents like TXTs from hard drive? And then combine them again to data set.

Any ideas appreciated.

colo · March 2011

Hi geschwader,

geschwader wrote:

OK, that's helped in a way. At least I started to work on date extraction. Thanks. However, even here I continue to have problems: I have a string "2011/03/17 20:06:08", use parsing format "yyyy/mm/dd hh:mm:ss" and have the result "17 January 2011 20:06:08 EET". WTF? Why January?

It seems a bit suspicious that you are trying to extract month and minute by the same pattern "m". Try "M" for month instead

Which operator do you mean by "HTML processing operator"? After using "Extract Information" you should still have a document, no example set data. I did a lot of similar web mining tasks and always converted the document to data (via "Documents to Data" or "Process Documents") after the necessary steps since the document type has some limitations for processing. Then I use the "Replace" operator to filter out HTML tags. A very simple solution can be achieved by a regular expression like "<^[>]*>" which shall be removed (replaced by nothing). I use a filter chain of replace operators to cover some special cases and use the whole filter process as operator (Execute process) every time I need HTML filtering.

Regards
Matthias

geschwader · March 2011

colo wrote:

Hi geschwader,

It seems a bit suspicious that you are trying to extract month and minute by the same pattern "m". Try "M" for month instead

Hello. Yeah, you're right

That's worked. Thanks.

Which operator do you mean by "HTML processing operator"?

I mean "Extract Content" (Web Mining → HTML Processing → Extract Content).

After using "Extract Information" you should still have a document, no example set data.

Yes, but, you see, it leaves the whole document and extracted information as metadata. I can't use "Extract Content" on metadata.

Then I use the "Replace" operator to filter out HTML tags. A very simple solution can be achieved by a regular expression like "<^[>]*>" which shall be removed (replaced by nothing). I use a filter chain of replace operators to cover some special cases and use the whole filter process as operator (Execute process) every time I need HTML filtering.

OK, that's one of the solutions. It did what I want in a couple of "replacings" (I tried it on a simple web page). Thanks a lot!
However, <^[>]*> didn't work: it leaved, for instance, all <p></p> tags. Sometimes pages have lots of tags, so I'm still wondering, if I could use the predefined filtering of the "Extract Content" operator in this case.

colo · March 2011

Hello again,

geschwader wrote:

However, <^[>]*> didn't work: it leaved, for instance, all <p></p> tags. Sometimes pages have lots of tags, so I'm still wondering, if I could use the predefined filtering of the "Extract Content" operator in this case.

I'm sorry, the circumflex should of course be inside the brackets to make some sense for tag-filtering: <[^>]*> - this should filter out any tag!

If you insist on using the "Extract Content" operator you could convert the example set back to documents as you described before. If you use the expert parameter "select attributes and weights" for the "Data to Documents" operator you can simply choose only the attribute containing your story text to avoid a whole collection of documents for all your attributes. This single document should work with the desired operator. Without having a look inside I would guess "Extract Content" also uses regular expressions. In my opinion you don't really need the conversion back to a document to achieve this.

Regards
Matthias

geschwader · March 2011

colo wrote:

In my opinion you don't really need the conversion back to a document to achieve this.

OK, let's leave it. But operator "Extract information" is used for single document too. When you use "Get pageS", how do you extract the same information from every document, if there are, say 200 of them? "Get pages" also has dataset output, which is not acceptable for "Extract information".

colo · March 2011

"Generate Extract" does the same for example sets and allows multiple examples (obtained from multiple documents). I just convert from documents to data after retrieving the pages (sometimes using "Cut Document" before to achieve multiple matches via XPath or RegEx).

geschwader · March 2011

colo wrote:

"Generate Extract" does the same for example sets and allows multiple examples (obtained from multiple documents). I just convert from documents to data after retrieving the pages (sometimes using "Cut Document" before to achieve multiple matches via XPath or RegEx).

Now everything is clear and working properly. Thank you!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Extracting text from a website using an

Answers