The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Using "Cut Document" Operator neglects numbers and punctuation in HTML text
Limegreenman900
Member Posts: 6 Contributor II
Hi everyone,
I am currently using the "Cut Document" Operator with query type "Regular Region" to extract specific text out of locally stored HTML files.
This works pretty good so far, however it seems as all numbers in the text are being neglected.
i.e. Original Text:
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">Companies Act 2006. Our audit work has been undertaken so that we might state to the company's members those</span></td>
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">concerning the cost of the fixed asset investment, stated at £51,925 in note 6 to the financial statements.</span></td>
Text after extraction:
Companies Act Our audit work has been undertaken so that we might state to the company s members those
concerning the cost of the fixed asset investment stated at  in note to the financial statements
Also punctuation characters like , and . are neglected. Anyone has an idea if there is a setting to get both, punctuation characters and numbers?
My code right now looks like this:
I am currently using the "Cut Document" Operator with query type "Regular Region" to extract specific text out of locally stored HTML files.
This works pretty good so far, however it seems as all numbers in the text are being neglected.
i.e. Original Text:
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">Companies Act 2006. Our audit work has been undertaken so that we might state to the company's members those</span></td>
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">concerning the cost of the fixed asset investment, stated at £51,925 in note 6 to the financial statements.</span></td>
Text after extraction:
Companies Act Our audit work has been undertaken so that we might state to the company s members those
concerning the cost of the fixed asset investment stated at  in note to the financial statements
Also punctuation characters like , and . are neglected. Anyone has an idea if there is a setting to get both, punctuation characters and numbers?
My code right now looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="112" y="30">
<parameter key="file" value="C:\Users\Independent Auditors Report\Prod224_0010_00178176_20131231.html"/>
<parameter key="extract_text_only" value="false"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
<parameter key="query_type" value="Regular Region"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries">
<parameter key="Independent Report" value="(?i)(>[^>]+Independent Auditors(')? to[^<]+<).name="[^"]+NameSeniorStatutoryAuditor""/>
</list>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content (2)" width="90" x="112" y="30">
<parameter key="minimum_text_block_length" value="3"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="313" y="30"/>
<operator activated="true" class="text:extract_token_number" compatibility="5.3.002" expanded="true" height="60" name="Extract Token Number" width="90" x="514" y="30"/>
<connect from_port="segment" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Extract Token Number" to_port="document"/>
<connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="120">
<list key="text_directories">
<parameter key="test" value="C:\Users\ndependent Auditors Report\Teil 1"/>
</list>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
If I use "linguistic tokens - english" as setting in the tokenize operator it works perfectly.