Problems with Xpath queries
I'm trying to crawl the Dell website and extract information from their laptops.
I have 2 problems:
1:
I'm not having any success trying to extract the Processor info in RapidMiner.
This is an example laptop page where I need to extract from: http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop
I'm trying to get the first Processor data (7th Generation AMD A9-9400 Processor with Radeon™ R5 Graphics).
I figured out the correct XPATH query in Google Chrome to extract it, but I can't get it to work in RapidMiner.
I have: $x("string(//span[contains(.,'Processor')]/../../../../following-sibling::div/div/div/div/div/span)")
to find it in Chrome.
I have tried this: string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)
and others, with no success in RapidMiner5 or RapidMiner7.
Does anyone know what is wrong with my XPATH query syntax for RapidMiner?
2.
The XPATH queries: normalize-space(//*[@id='sharedPdPageProductTitle']/text())
normalize-space(//*[@id='starting-price']/text())
both work in RapidMiner5 but not in RapidMiner7.
Is there something different with the XPATH syntax between RapidMiner5 and RapidMiner7?
Here are my Processes in XML form:
RapidMiner5:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="75">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="2000"/>
<parameter key="max_threads" value="4"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="30">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="514" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
RapidMiner7:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="85">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="max_crawl_depth" value="2"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="delay" value="2000"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="85">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Answers
I have actually had a similar problem where older XPath queries I created stopped working. I assumed it was because something at the web page had changed and I didn't bother to try to track it down, but based on this post, I am wondering whether it was instead because of a change in the implementation of XPath in RapidMiner. Hopefully one of the developers can provide some insight on this topic.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Thomas_Ott any chance we could ask one of the developers to take a look at this? Thanks.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Sure. I pinged 'em.
Any response about this issue yet? Thank you for reaching out to a developer about this!
Hi Trevor,
sorry for the delay. I just got the confirmation that no changes have been made regarding the XPath implementation.
Nevertheless I would like to thoroughly investigate this issue and try to find a solution.
Best,
Edin