The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
XPath returns empty values
OlliSchulz
Member Posts: 2 Contributor I
Hello everyone,
I just started to use RapidMiner and so far, it's doing everything I want it to do. However, I have encountered a problem, which I can't really solve by myself.
I mined a lot of html files and want to extract certain data by using XPath. I am using the "Process Documents from Files" operator, combined with the "Extract information" operator.
I want to extract data for the attributes "Datum", "Zeit", "Titel" and "Link". I receive correct values for 3 out of 4 attributes. However, I dont receive any values for the attribute "Titel".
I tried different XPath commands but non of them works.
I hope you can help me with this small problem.
Please find my RapidMiner settings and the structure of the html file I want to extract data from below:
RapidMiner settings
Greetings,
Olli
I just started to use RapidMiner and so far, it's doing everything I want it to do. However, I have encountered a problem, which I can't really solve by myself.
I mined a lot of html files and want to extract certain data by using XPath. I am using the "Process Documents from Files" operator, combined with the "Extract information" operator.
I want to extract data for the attributes "Datum", "Zeit", "Titel" and "Link". I receive correct values for 3 out of 4 attributes. However, I dont receive any values for the attribute "Titel".
I tried different XPath commands but non of them works.
I hope you can help me with this small problem.
Please find my RapidMiner settings and the structure of the html file I want to extract data from below:
RapidMiner settings
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Extract from html file
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="374" width="434">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="208">
<list key="text_directories">
<parameter key="63DU" value="C:\Users\Admin\Desktop\RapidMiner\63DU"/>
<parameter key="ADS" value="C:\Users\Admin\Desktop\RapidMiner\ADS"/>
<parameter key="ALV" value="C:\Users\Admin\Desktop\RapidMiner\ALV"/>
<parameter key="BAS" value="C:\Users\Admin\Desktop\RapidMiner\BAS"/>
<parameter key="BAY" value="C:\Users\Admin\Desktop\RapidMiner\BAY"/>
<parameter key="BEI" value="C:\Users\Admin\Desktop\RapidMiner\BEI"/>
<parameter key="BMW" value="C:\Users\Admin\Desktop\RapidMiner\BMW"/>
<parameter key="CBK" value="C:\Users\Admin\Desktop\RapidMiner\CBK"/>
<parameter key="DAI" value="C:\Users\Admin\Desktop\RapidMiner\DAI"/>
<parameter key="DBK" value="C:\Users\Admin\Desktop\RapidMiner\DBK"/>
<parameter key="DPW" value="C:\Users\Admin\Desktop\RapidMiner\DPW"/>
<parameter key="DTE" value="C:\Users\Admin\Desktop\RapidMiner\DTE"/>
<parameter key="EOAN" value="C:\Users\Admin\Desktop\RapidMiner\EOAN"/>
<parameter key="FME" value="C:\Users\Admin\Desktop\RapidMiner\FME"/>
<parameter key="FRE" value="C:\Users\Admin\Desktop\RapidMiner\FRE"/>
<parameter key="HEI" value="C:\Users\Admin\Desktop\RapidMiner\HEI"/>
<parameter key="HEN3" value="C:\Users\Admin\Desktop\RapidMiner\HEN3"/>
<parameter key="IFX" value="C:\Users\Admin\Desktop\RapidMiner\IFX"/>
<parameter key="LHA" value="C:\Users\Admin\Desktop\RapidMiner\LHA"/>
<parameter key="LIN" value="C:\Users\Admin\Desktop\RapidMiner\LIN"/>
<parameter key="MAN" value="C:\Users\Admin\Desktop\RapidMiner\MAN"/>
<parameter key="MEO" value="C:\Users\Admin\Desktop\RapidMiner\MEO"/>
<parameter key="MRK" value="C:\Users\Admin\Desktop\RapidMiner\MRK"/>
<parameter key="MUV2" value="C:\Users\Admin\Desktop\RapidMiner\MUV2"/>
<parameter key="RWE" value="C:\Users\Admin\Desktop\RapidMiner\RWE"/>
<parameter key="SAP" value="C:\Users\Admin\Desktop\RapidMiner\SAP"/>
<parameter key="SDF" value="C:\Users\Admin\Desktop\RapidMiner\SDF"/>
<parameter key="SIE" value="C:\Users\Admin\Desktop\RapidMiner\SIE"/>
<parameter key="TKA" value="C:\Users\Admin\Desktop\RapidMiner\TKA"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true" height="392" width="452">
<operator activated="true" class="text:extract_information" compatibility="5.1.002" expanded="true" height="60" name="Extract Information" width="90" x="112" y="165">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Datum" value="//h:td[@class='DATUM']/text()"/>
<parameter key="Zeit" value="//h:td[@class='ZEIT']/text()"/>
<parameter key="Titel" value="//h:td[@class='ARTIKEL_TITEL']/text()"/>
<parameter key="Link" value="//h:td[@class='ARTIKEL_TITEL']/h:a/@href"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
<table>Many thanks in advance!
<colgroup>
<col class="DATUM" />
<col class="ZEIT" />
<col class="NEWS" />
</colgroup>
<thead>
<tr>
<th class="DATUM"> Datum </th>
<th class="ZEIT"> Zeit </th>
<th class="ARTIKEL_TITEL"> News </th>
</tr>
</thead>
<tbody>
<tr>
<td class="DATUM"> 07.09. </td>
<td class="ZEIT"> 18:06 </td>
<td class="ARTIKEL_TITEL"><a href="http://www.onvista.de/news/unternehmensberichte/artikel/07.09.2011-18:06:10-roundup-aktien-frankfurt-schluss-sehr-fest-dax-profitiert-von-bvg-entscheidung?suche=496b0ceba408ca796b867195c2b6dfe5" title="ROUNDUP/Aktien Frankfurt Schluss: Sehr fest; Dax profitiert von BVG-Entscheidung"> ROUNDUP/Aktien Frankfurt Schluss: Sehr fest; Dax profitiert von BVG-En... </a></td>
</tr>
<tr class="HERVORGEHOBEN">
<td class="DATUM"> 07.09. </td>
<td class="ZEIT"> 15:58 </td>
<td class="ARTIKEL_TITEL"><a href="http://www.onvista.de/news/unternehmensberichte/artikel/07.09.2011-15:58:08-roundup-4-saab-beantragt-glaeubigerschutz-das-aus-rueckt-immer-naeher?suche=496b0ceba408ca796b867195c2b6dfe5" > ROUNDUP 4: Saab beantragt Gläubigerschutz: Das Aus rückt immer näher </a></td>
</tr>
</tbody>
</table>
Greetings,
Olli
0
Answers
Correct XPath string for the "Titel" attribute is //h:td[@class='ARTIKEL_TITEL']/h:a/text()