[SOLVED] xpath

amypu · December 2013

Below is an example XML.


Thisisgood


Thisisbad


This
 
is
 
acceptable


Thisisfine


I want the result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine

I use Xpath //p/text() in Google Doc (=importXML). Ultimately, I will use //h:p/text() in Rapidminer (with Extract Information operator). This results in:
Thisisgood
Thisisbad
This is acceptable (appearing in different cells)
Thisisfine

What XPath would give me the result I need? Thank you.

MariusHelf · December 2013

Well, what result do you need?

Best regards,
Marius

amypu · December 2013

I would like to have the following result:

Thisisgood
Thisisbad
Thisisacceptable
Thisisfine

I DO NOT want:

This is acceptable (appearing in different cells)

Thanks.

MariusHelf · December 2013

Hi,

this is the community forum - for guaranteed answering times please consider to get a support contract. During the holidays our main focus is not on free support

However, let's focus on your issues: which versions of RapidMiner and the Text and Web extension are you using? I can't reproduce the behavior with text in different cells with Extract Information. In the latest versions Extract Information delivers only the first result node, in the case of //h:p/text() that would be "This" in the "this is acceptable" case. This is surely also not what you want. So in your case the proceeding would be to cut the document into its p tags and then extract the content of each p node with Extract Content. Optionally you can then use Replace to remove the spaces.

Please see the process below for details.

Best regards,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
        <parameter key="text" value="&lt;p&gt; &#10;Thisisgood&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;Thisisbad&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;This&#10;&lt;br&gt;&#10;is&#10;&lt;br&gt;&#10;acceptable&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;Thisisfine&#10;&lt;/p&gt;&#10;"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="p" value="//h:p"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
            <parameter key="minimum_text_block_length" value="1"/>
          </operator>
          <operator activated="false" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="313" y="120">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="result" value=" //h:p/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="segment" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="380" y="30">
        <parameter key="text_attribute" value="text"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] xpath

Answers