The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Enrich data from Web Service - Xpath Access

kludikovskykludikovsky Member Posts: 30 Maven
edited November 2018 in Help

Simple question:

What's wrong with this Xpath ?

Now a a little more on information:

I am trying to add information to already available data. Therefore the 'Enrich Data from Web Service' seemed the proper tool.

But I can't get the data I am looking for.

As I found out so far, the Xpath does not work as expected. (This might have to do with my understanding of Xpath ;-) ) 

Therefore I created a test, which is attached below.

This contains 4 slightly different test cases:

  • test_1..3
  • test_4..6
  • head_1..4
  • html

My question.

Why are only some cases delivering data and others not?    Especially those where there are elements directly addressed.

 

Any solutions or hints are welcome.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset_from_doc" compatibility="0.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
<parameter key="Column Separator" value=","/>
<parameter key="Input Csv" value="a&#10;1"/>
</operator>
<operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="380" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Nominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="test_1" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]//*[@class=&amp;quot;address&quot;]"/>
<parameter key="test_2" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]/div"/>
<parameter key="test_3" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]/div[1]"/>
<parameter key="test_4" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]//*[@itemprop=&amp;quot;url&quot;]"/>
<parameter key="test_5" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]/a"/>
<parameter key="test_6" value="//*[@id=&amp;quot;main-container&quot;]//*[@class=&amp;quot;result-content&quot;]/a[1]"/>
<parameter key="head_1" value="//html"/>
<parameter key="head_2" value="//head"/>
<parameter key="head_3" value="//*/head"/>
<parameter key="head_4" value="//*"/>
<parameter key="html" value="html"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="true"/>
<parameter key="assume_html" value="true"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<parameter key="request_method" value="GET"/>
<parameter key="url" value="http://www.firmenabc.at/result.aspx?what=haniger+benesch+versicherungs+makler+gmbh&amp;where=&amp;exact=false&amp;inTitleOnly=false&amp;l=&amp;si=0&amp;iid=&amp;sid=-1&amp;did=&amp;cc="/>
<parameter key="delay" value="500"/>
<list key="request_properties"/>
<parameter key="encoding" value="UTF-8"/>
<description align="center" color="transparent" colored="false" width="126">Get the data from FirmenABC</description>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
<connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Best Answer

  • kludikovskykludikovsky Member Posts: 30 Maven
    Solution Accepted

    After severals days of experiementation and searching the web:

     

    There are two reasons why this does not work properly:

     

    1) http: 301 Page moved

    RM does not handle moved pages. So if you are looking for a  page which responds with http 301 - which the browser will forward you to - RM will not. 

    Found that thanks to @sgenzer here http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Open-File-not-returning-data-from-url/m-p/41351#M28008

     

     

    2) h: namespace tag required

    All html tags need to be prefixed with the 'h:'-namespace-prefix. Even as the html is per default set and need not to be specified in the namespace-definition it need to be specified in the xpath-queries. 

    (It might be an improvement idea for this operator to have the 'h:'-namespace as a preset, so that xpath's from browsers can be used without any modifications)

    Foudn this thanks to a small note here

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hi @kludikovsky - so the first thing I do when trying to decode XML or JSON is first look at the array myself and see what's going on.  I do this by setting the parameter "query type" to "Regular Expression" and just adding one attribute called "foo" with a RegEx of ".*"  Then I paste it into a nice XML or JSON viewer (codebeautify.org is a decent one), and only afterwards start building my XML or JSON paths in the operator. 

     

    So back to your question...my first piece of advice is to make sure you're getting a true XPath.  From what I can see, you're getting the whole HTML from some website instead of an XML data array.  So is this what you want to do?  Or rather do you want to extract the content from the site?  If the latter, you can use "Extract Content" from the Web Mining extension or many other tools (e.g. Unescape HTML) and go from there.

     

    For additional help on XML/JSON path expressions, my go-to site is http://goessner.net/articles/JsonPath/

     

    Scott

  • kludikovskykludikovsky Member Posts: 30 Maven

    hi @sgenzer,

     

    I would have used the XPath if I hadn't analysed the document before.

    I took several approaches - including the Xpath from the browser - and modified them.

    The examples should point to the same location, as long as my understanding of the XPath syntax is correct.

     

    And there is another question:

    With the Extract Content, how yould you extract attributes from an element, in this case e.g. a title="xxx"?)

    I'd like to get specific elements and not text blocks.

     

    Kurt

     

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @kludikovsky - so there are many ways to extract content from HTML block code in RapidMiner.  Extract Content is useful if you are trying to pull out the content of a web page and not something tagged like an href or whatever.  In this case, my personal preference in this situation is to parse the code using various string operations in the Generate Attributes function operator.  To get the title of that web page, I would do something like this:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34">
    <parameter key="url" value="http://www.firmenabc.at/result.aspx?what=haniger+benesch+versicherungs+makler+gmbh&amp;amp;where=&amp;amp;exact=false&amp;amp;inTitleOnly=false&amp;amp;l=&amp;amp;si=0&amp;amp;iid=&amp;amp;sid=-1&amp;amp;did=&amp;amp;cc="/>
    <list key="query_parameters"/>
    <list key="request_properties"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="false"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="suf" value="suffix(text,length(text)-index(text,&quot;link title=&quot;)-12)"/>
    <parameter key="repl" value="replaceAll(suf,&quot;\&quot;&quot;,&quot;&quot;)"/>
    <parameter key="title" value="prefix(repl,index(repl,&quot; &quot;))"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="title"/>
    </operator>
    <connect from_op="Get Page" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    There are probably a half dozen other operators to do this content extraction - from the Text Processing extension, the Web Mining extension, or just using core components like I did above.   Take your pick.

     

    Scott

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    I reopen this post because I've been trying to use XPaths without success. I have simple XPath expressions that work with Scrapy, like '//h2/a/text()' but doesn't in operators like Cut Documents or Extract Information.

     

    I can find the expressions relatively easy with the Scrapy shell and my web browser (inspect element...). But if I can't use them in RM, they are useless. Any kind of guide on the matter?

Sign In or Register to comment.