The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Problem with Xpath query? Processing documents from web
Hi there,
I am trying to extract documents from a movie review site. When I run the process below I get 0 results but can't figure out the problem, can anyone help? Thanks.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"><br> <process expanded="true"><br> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="313" y="238"><br> <parameter key="number_of_iterations" value="10"/><br> <process expanded="true"><br> <operator activated="true" class="web:process_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Process Documents from Web" width="90" x="179" y="85"><br> <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/><br> <list key="crawling_rules"/><br> <process expanded="true"><br> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="246" y="34"><br> <parameter key="query_type" value="XPath"/><br> <list key="string_machting_queries"/><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="seg" value="//h:table[@class='table table-striped']/h:tr"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <list key="jsonpath_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34"><br> <parameter key="query_type" value="XPath"/><br> <list key="string_machting_queries"/><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="text" value="//h:p/text|)"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <list key="jsonpath_queries"/><br> </operator><br> <connect from_port="document" to_op="Cut Document" to_port="document"/><br> <connect from_op="Cut Document" from_port="documents" to_port="document 1"/><br> <portSpacing port="source_document" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Process Documents from Web" from_port="example set" to_port="output 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Loop" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
0
Best Answer
-
kayman Member Posts: 662 UnicornBit hard to explain, but what you do is as follows :
You select the reviews with a loop logic, the translation of the xpath used is a bit like 'give me the text of every review that has a class called 'the_review:neutral
But then you take the xpath for the first match of each attribute, but this doesn't give the right result as every review has this data on a different location, relative to the actual review, so you do not map these together. With the current structure you loose all relation between the data, and what you need is more like
For every div containing a review, get me the parameters (reviewer, date etc) that are part of this div.
(told you it was hard to explain )
Long story short, I'm not sure you can get this with the typical xpath extractor, but you can use xpath directly with the xslt operators.
I've attached an example, it's a bit more complex but still relatively easy to adapt.
The logic is to create a proper xml from the htm first (the code is not xhtml) and then use dedicated xpath, this returns a nice table with your data ready to use<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34"> <parameter key="number_of_iterations" value="10"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="289"> <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/> <list key="query_parameters"/> <list key="request_properties"/> </operator> <operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="179" y="289"/> <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="313" y="289"> <list key="replace_dictionary"> <parameter key="(?s)^.*?<html.*?>" value="<html>"/> </list> </operator> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="646"> <parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 	<xsl:template match="/"> 		<root> 			<xsl:for-each select="//div[@class='row review_table_row']"> 				<!--<xsl:copy-of select="."/>--> 				<row 				critic="{normalize-space(.//div[contains(@class,'critic_name')]/a[1])}" 				publisher="{normalize-space(.//div[contains(@class,'critic_name')]/a[2]/em)}" 				date="{normalize-space(.//div[contains(@class,'review_date')])}" 				review="{normalize-space(.//div[@class='the_review'])}" 				score="{normalize-space(.//div[@class='small subtle'][contains(.,'Original Score')])}"/> 			</xsl:for-each> 		</root> 	</xsl:template> </xsl:stylesheet>"/> </operator> <operator activated="true" class="text:process_xslt" compatibility="8.1.000" expanded="true" height="82" name="Process XSLT" width="90" x="313" y="544"/> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="544"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"> <parameter key="review" value="<row./>"/> </list> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="review" value="<row./>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Critic Name" value="//@critic"/> <parameter key="Reviews" value="//@review"/> <parameter key="Date Posted" value="//@date"/> <parameter key="Publisher" value="//@publisher"/> <parameter key="Score" value="//@score"/> </list> <list key="namespaces"/> <parameter key="ignore_CDATA" value="false"/> <parameter key="assume_html" value="false"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="segment" to_op="Extract Information" to_port="document"/> <connect from_op="Extract Information" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="544"> <parameter key="text_attribute" value="content"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="391"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attributes" value="content|query_key"/> <parameter key="invert_selection" value="true"/> </operator> <connect from_op="Get Page" from_port="output" to_op="HTML to XML" to_port="document"/> <connect from_op="HTML to XML" from_port="document" to_op="Replace Tokens" to_port="document"/> <connect from_op="Replace Tokens" from_port="document" to_op="Process XSLT" to_port="document"/> <connect from_op="Create Document" from_port="output" to_op="Process XSLT" to_port="xslt document"/> <connect from_op="Process XSLT" from_port="document" to_op="Cut Document" to_port="document"/> <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="313" y="34"> <parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/> </operator> <connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/> <connect from_op="Store" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5
Answers
The reviews are not in a table but in a div, the used logic is looking for a table but that is not existing (table-striped cannot be found in the source code)
This is how a review is stored, using a div with class 'the_review'.
<div class="the_review">
This is a lovely, funny, wonderfully acted film. The big problem is, it's an 80-minute movie that takes two hours. By the time you get to the real story, you're out of gas.
</div>
so try with
It's untested, so don't take it for granted :-)
What could have happened is that you tested the site during an A/B test, or that the page code is different depending on the agent used by Rapidminer.
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34">
<parameter key="number_of_iterations" value="10"/>
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
<parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="34">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Review" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[1]/text() "/>
<parameter key="Date Posted" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[1]/text()"/>
<parameter key="Publisher" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[2]/h:em/text()"/>
<parameter key="Score" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[2]/text"/>
<parameter key="Critic Name" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[1]/text() "/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Feedback_text" value="//h:div[@class='the_review']/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="1251" y="85">
<parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/>
</operator>
<connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Now, if you are pretty familiar with XPath and XSLT I'd suggest to use the process XSLT operator instead. Just insert your XSLT (v1.0) in a document and convert your page any way you like as a pro...