Extract Information Problem

dezok · August 2011

Hi, I'm quite new to this stuff, but i.m stuck on how to extract part of HTML code from saved web page with XPath. Code onpage looks like below and i would like to extract the number.


<div>
<span>Age:</span>
 22
</div>

I tried xpath below on google docs and it works exactly how i want it to.

//*[contains(., 'Age:')]/../span/text()

But i just can't figure out what am i missing in Rapid miner. I tried putting html namespace, as well as adding predicates. Rapid xpath looked like:

//h:*[contains(., 'Age:')]/../span/text()[2]

And it gives empty fields as results. Anyone has some suggestions?

colo · August 2011

Hi dezok,

it's hard to believe that you got the number extracted with the query you posted for your example. Since you request the text content of the span element, this should return "Age:".
But if you leave out the span element and address the second text element (the first one seems to be only whitespace), then it should work:

//h:*[contains(., 'Age:')]/../text()[2]

Regards
Matthias

dezok · August 2011

Hi colo, my mistake, ofcourse i used

//h:*[contains(., 'Age:')]/../text()

without the span element (with desperation i tried span and forgot to clear it out), I used the query with addressing the second text element and still got no results.
I did use a different query (posted below along with bigger html sample) to get age and it worked with addressing the second text element, but it creates another problem, i've got more than age element to scrap, and they're not always in the same order (sometimes i miss one element) and i get wierd results lik: Age: female.
HTML code:


<div class="userAttributes">
<div>
<span>Age:</span>
22
</div>
<div>
<span>Sex:</span>
female
</div>
</div>

Xpath that scrapped Age in rapid:


//h:div[@class='userAttributes']/h:div[1]/text()[2]

Xpath that scrapped Sex:


//h:div[@class='userAttributes']/h:div[2]/text()[2]

But as i said, when there is no age in html i get results like "Age: female"

colo · August 2011

Hi dezok,

using fixed position predicates of course causes trouble, if contents are missing or appear in another order for different documents. You have to consider the text contents to avoid this. I tried to use the following::text() axis to get the text content after addressing one of the span nodes, but this always returned the text of the next node. I am not sure, if the axis works for text(), but don't have the time to investigate further. So I went one step back and selected the text as in the initial example.

This example process works fine for me:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="145" width="279">
      <operator activated="true" class="text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="&lt;div class=&quot;userAttributes&quot;&gt;&#10;&lt;div&gt;&#10;&lt;span&gt;Age:&lt;/span&gt;&#10;22&#10;&lt;/div&gt;&#10;&lt;div&gt;&#10;&lt;span&gt;Sex:&lt;/span&gt;&#10;female&#10;&lt;/div&gt;&#10;&lt;/div&gt;"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="age" value="//h:div[@class=&amp;quot;userAttributes&quot;]//h:span[contains(./text(), &quot;Age&quot;)]/../text()[2]"/>
          <parameter key="sex" value="//h:div[@class=&amp;quot;userAttributes&quot;]//h:span[contains(./text(), &quot;Sex&quot;)]/../text()[2]"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
      <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards
Matthias

dezok · August 2011

Hi again colo,
that works great

you're a life saver. Thanks for help.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extract Information Problem

Answers