The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
(Solved) Removing tags from extracted data
I'm very new and starting with a scraping process. It really doesn't have a function, I'm just playing around trying to learn. My process was originally based on Neil McGuigan's tutorials on Vancouver Data Blog, but as I try new things it's grown a bit.
Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.
The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.
This is what a typical result looks like:
Here's the XML behind my process:
Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.
The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.
This is what a typical result looks like:
<td xmlns="http://www.w3.org/1999/xhtml" colspan="1" rowspan="1">33</td>But all I need is the 33.
Here's the XML behind my process:
I checked the FAQ, the tutorials, and searched the forums, but I haven't found anything. Any suggestions?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
<parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
<process expanded="true" height="620" width="435">
<operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
<parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
</list>
<parameter key="max_pages" value="6"/>
<parameter key="max_depth" value="4"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="5000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
<parameter key="parallelize_process_webpage" value="true"/>
<process expanded="true" height="620" width="433">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Fighter" value="//h:div[@class='Resume']/h:h1"/>
<parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]"/>
<parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]"/>
<parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]"/>
<parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]"/>
<parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]"/>
<parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]"/>
<parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]"/>
<parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
<parameter key="excel_file" value="C:\Users\Public\Documents\Rapidminer Repository\Results\Results.xls"/>
</operator>
<connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
<connect from_op="Write Excel" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="18"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
you can use the XPath text() function: Best,
Nils