Enrich data from Web Service - Xpath Access
Simple question:
What's wrong with this Xpath ?
Now a a little more on information:
I am trying to add information to already available data. Therefore the 'Enrich Data from Web Service' seemed the proper tool.
But I can't get the data I am looking for.
As I found out so far, the Xpath does not work as expected. (This might have to do with my understanding of Xpath ;-) )
Therefore I created a test, which is attached below.
This contains 4 slightly different test cases:
- test_1..3
- test_4..6
- head_1..4
- html
My question.
Why are only some cases delivering data and others not? Especially those where there are elements directly addressed.
Any solutions or hints are welcome.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset_from_doc" compatibility="0.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
<parameter key="Column Separator" value=","/>
<parameter key="Input Csv" value="a 1"/>
</operator>
<operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="380" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Nominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="test_1" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]//*[@class=&quot;address"]"/>
<parameter key="test_2" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]/div"/>
<parameter key="test_3" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]/div[1]"/>
<parameter key="test_4" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]//*[@itemprop=&quot;url"]"/>
<parameter key="test_5" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]/a"/>
<parameter key="test_6" value="//*[@id=&quot;main-container"]//*[@class=&quot;result-content"]/a[1]"/>
<parameter key="head_1" value="//html"/>
<parameter key="head_2" value="//head"/>
<parameter key="head_3" value="//*/head"/>
<parameter key="head_4" value="//*"/>
<parameter key="html" value="html"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="true"/>
<parameter key="assume_html" value="true"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<parameter key="request_method" value="GET"/>
<parameter key="url" value="http://www.firmenabc.at/result.aspx?what=haniger+benesch+versicherungs+makler+gmbh&where=&exact=false&inTitleOnly=false&l=&si=0&iid=&sid=-1&did=&cc="/>
<parameter key="delay" value="500"/>
<list key="request_properties"/>
<parameter key="encoding" value="UTF-8"/>
<description align="center" color="transparent" colored="false" width="126">Get the data from FirmenABC</description>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
<connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
kludikovsky Member Posts: 30 Maven
After severals days of experiementation and searching the web:
There are two reasons why this does not work properly:
1) http: 301 Page moved
RM does not handle moved pages. So if you are looking for a page which responds with http 301 - which the browser will forward you to - RM will not.
Found that thanks to @sgenzer here http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Open-File-not-returning-data-from-url/m-p/41351#M28008
2) h: namespace tag required
All html tags need to be prefixed with the 'h:'-namespace-prefix. Even as the html is per default set and need not to be specified in the namespace-definition it need to be specified in the xpath-queries.
(It might be an improvement idea for this operator to have the 'h:'-namespace as a preset, so that xpath's from browsers can be used without any modifications)
Foudn this thanks to a small note here
2
Answers
Hi @kludikovsky - so the first thing I do when trying to decode XML or JSON is first look at the array myself and see what's going on. I do this by setting the parameter "query type" to "Regular Expression" and just adding one attribute called "foo" with a RegEx of ".*" Then I paste it into a nice XML or JSON viewer (codebeautify.org is a decent one), and only afterwards start building my XML or JSON paths in the operator.
So back to your question...my first piece of advice is to make sure you're getting a true XPath. From what I can see, you're getting the whole HTML from some website instead of an XML data array. So is this what you want to do? Or rather do you want to extract the content from the site? If the latter, you can use "Extract Content" from the Web Mining extension or many other tools (e.g. Unescape HTML) and go from there.
For additional help on XML/JSON path expressions, my go-to site is http://goessner.net/articles/JsonPath/
Scott
hi @sgenzer,
I would have used the XPath if I hadn't analysed the document before.
I took several approaches - including the Xpath from the browser - and modified them.
The examples should point to the same location, as long as my understanding of the XPath syntax is correct.
And there is another question:
With the Extract Content, how yould you extract attributes from an element, in this case e.g. a title="xxx"?)
I'd like to get specific elements and not text blocks.
Kurt
hi @kludikovsky - so there are many ways to extract content from HTML block code in RapidMiner. Extract Content is useful if you are trying to pull out the content of a web page and not something tagged like an href or whatever. In this case, my personal preference in this situation is to parse the code using various string operations in the Generate Attributes function operator. To get the title of that web page, I would do something like this:
There are probably a half dozen other operators to do this content extraction - from the Text Processing extension, the Web Mining extension, or just using core components like I did above. Take your pick.
Scott
I reopen this post because I've been trying to use XPaths without success. I have simple XPath expressions that work with Scrapy, like '//h2/a/text()' but doesn't in operators like Cut Documents or Extract Information.
I can find the expressions relatively easy with the Scrapy shell and my web browser (inspect element...). But if I can't use them in RM, they are useless. Any kind of guide on the matter?