basic xpath problem

Kintaro · March 2015

Hello,

I'm trying to extract data with xpath from an html page.

I have:
Create Document => Extract Information

Create Document:


<html>
<head>
<title>TITLE</title>
</head>
<body>BODY</body>
</html>

Extract Information configurated with:
query type: xpath
attribute type: nominal
xpath queries: //title
namespace: n/a
ignore CDATA: true
assume html: true

Result:
attribute name: ?

What am I doing wrong? >:(

Kintaro · March 2015

I'm asking this because if I try the same thing in a online path test it work without any problem... so I don't know why Rapidminer isn't.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
        <parameter key="text" value="&lt;html&gt;&#10;&lt;head&gt;&#10;&lt;title&gt;TITLE&lt;/title&gt;&#10;&lt;/head&gt;&#10;&lt;body&gt;BODY&lt;/body&gt;&#10;&lt;/html&gt;"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="6.1.000" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries">
          <parameter key="nome" value="&lt;title&gt;.&lt;/title&gt;"/>
        </list>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="nome" value="//title"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
      <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Kintaro · March 2015

Solved

I can't use path like this, I have to use for example:

//h:title/text()

text() to extract only the text from the title tag

and I have to use h: because is html, right?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

basic xpath problem

Answers