XPath with "Cut Document" or "Extract Information" with "?"-result
Dear RM-experts,
I´m struggling trying to extract certain information from websites I crawled.
My process is as follows:
I have a "Crawl web" operator crawling websites in a loop. This process works fine (tested with up to 17 iterations).
The crawled web pages are stored as html-files (one file for each site).
Now I want to get a specific information from these websites for which I have an XPath-statement, that works fine on google spreadsheet but not in RM. I tried the process with the recommended "Cut Document"-operator and with the "Extract Information"-operator within a "Process Documents from Files"-Process.
I already searched the forum and tried all possible versions of "//h:" and "assume html" - knowing that the syntax in RM is slightly different - but with no success.
Is anybody out there with a solution for this issue?
Here is my current process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true">
<operator activated="false" class="concurrency:loop" compatibility="7.5.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
<parameter key="number_of_iterations" value="2"/>
<parameter key="reuse_results" value="true"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="34">
<parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;page=%{iteration}#ms-jobs-result-list"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+standard.+"/>
<parameter key="follow_link_with_matching_url" value=".+standard.*"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
<parameter key="output_file_extension" value="%{iteration}.html"/>
<parameter key="max_pages" value="20"/>
<parameter key="max_page_size" value="100"/>
<parameter key="delay" value="1000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="246" y="136">
<list key="text_directories">
<parameter key="all" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="use_file_extension_as_type" value="false"/>
<parameter key="content_type" value="html"/>
<parameter key="encoding" value="UTF-8"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Branche" value="//*[@id=&quot;ms-maincontent"]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]/text()"/>
</list>
<list key="namespaces"/>
<parameter key="assume_html" value="false"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Thanks for your support.
Answers
I've done a quick test with following page :
https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&page=1#ms-jobs-result-list
Which is basically the first page you would grab using the logic. On this page there is no h4 that contains the text Arbeitgeber hence why you get no results.
Apart from that you need to add the h: for every element since all of them are using the same html namespace. Below example will show you the match till the 4th div, as from there your Xpath does not match anything anymore. This may be because of the page I used so it could work for you.
Hope this helps.
Dear Kayman,
thanks for this immediate response.
You´re right for the results page you tested but I am on the specific job page like this:
https://jobs.meinestadt.de/deutschland/standard?id=200880935
to judge, wether a job is posted directly by a company or by an agency for personnel leasing.
On that detail page the XPath //*[@id="ms-maincontent"]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2] returns as value "Befristete Überlassung von Arbeitskräften" so this is a personnel leasing job posting.
I now tried with
//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]
//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h4[contains(text(),'Arbeitgeber')]/h:following-sibling::h:p[2]
//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]
//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(text(),'Arbeitgeber')]/h:following-sibling::h:p[2]
but with no success. Where did I go wrong?
You used the 'html to xml'-operator. How can I use this operator for stored html-sites?
Thanks
Aaah, found it. Try this :
Bit hard to explain, but what you did is actually select the h4, and then travel further to the second p within this node, but your h4 does not have nodes so selecting a sibling has no use. Instead you have to select the element that contains the h4 (in this case the section) and get the second p in the section.
Another way would be to go one step up once you select the h4, and then get the second element as below
The double dot takes you back to the parent level, but this may be less reliable as the first variation
Don't mind the html to xml operator by the way, I typically use this since I load the xml into another editor and this way I am always ensured the html is proper xhtml.
Dear Kayman,
many thanks for that - now it works fine with your first suggestion
Just one tiny little detail: result now is:
"<p xmlns='http://www.w3.org/1999xhtml'>Befristete Überlassung von Arbeitskräften</p>"
Any idea how I can get only to "Befristete Überlassung von Arbeitskräften"?
Great support!!
I´m trying to get through the other elements with your sample.
Thank you very much.
Hello,
Please, may I know how you obtained your process code?