The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Extract Information Problem

dezokdezok Member Posts: 3 Contributor I
edited November 2018 in Help
Hi, I'm quite new to this stuff, but i.m stuck on how to extract part of HTML code from saved web page with XPath. Code onpage looks like below and i would like to extract the number.

<div>
<span>Age:</span>
22
</div>
I tried xpath below on google docs and it works exactly how i want it to.
//*[contains(., 'Age:')]/../span/text()
But i just can't figure out what am i missing in Rapid miner. I tried putting html namespace, as well as adding predicates. Rapid xpath looked like:
//h:*[contains(., 'Age:')]/../span/text()[2]
And it gives empty fields as results. Anyone has some suggestions?

Answers

  • colocolo Member Posts: 236 Maven
    Hi dezok,

    it's hard to believe that you got the number extracted with the query you posted for your example. Since you request the text content of the span element, this should return "Age:".
    But if you leave out the span element and address the second text element (the first one seems to be only whitespace), then it should work:
    //h:*[contains(., 'Age:')]/../text()[2]
    Regards
    Matthias
  • dezokdezok Member Posts: 3 Contributor I
    Hi colo, my mistake, ofcourse i used
    //h:*[contains(., 'Age:')]/../text()
    without the span element (with desperation i tried span and forgot to clear it out), I used the query with addressing the second text element and still got no results.
    I did use a different query (posted below along with bigger html sample) to get age and it worked with addressing the second text element, but it creates another problem, i've got more than age element to scrap, and they're not always in the same order (sometimes i miss one element) and i get wierd results lik: Age: female.
    HTML code:

    <div class="userAttributes">
    <div>
    <span>Age:</span>
    22
    </div>
    <div>
    <span>Sex:</span>
    female
    </div>
    </div>
    Xpath that scrapped Age in rapid:

    //h:div[@class='userAttributes']/h:div[1]/text()[2]
    Xpath that scrapped Sex:

    //h:div[@class='userAttributes']/h:div[2]/text()[2]
    But as i said, when there is no age in html i get results like "Age: female"
  • colocolo Member Posts: 236 Maven
    Hi dezok,

    using fixed position predicates of course causes trouble, if contents are missing or appear in another order for different documents. You have to consider the text contents to avoid this. I tried to use the following::text() axis to get the text content after addressing one of the span nodes, but this always returned the text of the next node. I am not sure, if the axis works for text(), but don't have the time to investigate further. So I went one step back and selected the text as in the initial example.

    This example process works fine for me:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="145" width="279">
          <operator activated="true" class="text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="&lt;div class=&quot;userAttributes&quot;&gt;&#10;&lt;div&gt;&#10;&lt;span&gt;Age:&lt;/span&gt;&#10;22&#10;&lt;/div&gt;&#10;&lt;div&gt;&#10;&lt;span&gt;Sex:&lt;/span&gt;&#10;female&#10;&lt;/div&gt;&#10;&lt;/div&gt;"/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="age" value="//h:div[@class=&amp;quot;userAttributes&quot;]//h:span[contains(./text(), &quot;Age&quot;)]/../text()[2]"/>
              <parameter key="sex" value="//h:div[@class=&amp;quot;userAttributes&quot;]//h:span[contains(./text(), &quot;Sex&quot;)]/../text()[2]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Regards
    Matthias
  • dezokdezok Member Posts: 3 Contributor I
    Hi again colo,
    that works great :) you're a life saver. Thanks for help.
Sign In or Register to comment.