The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Removing HTTP Headers
I'm trying to do some text analytics on a set of pre-downloaded html files but unfortunately they also include the HTTP headers (e.g. Content-type: text/html). I've tried using Remove Document Parts with regular expressions to strip out the headers before passing the document to Extract Content, but for some reason the Extract Content operator ignores the removals. To test this I setup a simple process that takes a text file as input containing the words "one two three". The Remove Document Parts removes the word one (checked via breakpoint) but the final output includes it. Can anyone help me understand why Extract Content is ignoring the prior removal, or suggest some workarounds or alternate methods of removing HTTP headers from files?
Thanks.
As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.
Thanks.
Updated:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
<process expanded="true" height="460" width="899">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="test" value="C:\Users\XXX\test_files"/>
</list>
<process expanded="true" height="460" width="899">
<operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="45" y="30">
<parameter key="deletion_regex" value="one"/>
</operator>
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
<parameter key="minimum_text_block_length" value="3"/>
</operator>
<connect from_port="document" to_op="RM One" to_port="document"/>
<connect from_op="RM One" from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.
0
Answers
if you place a 'Combine Documents' operator after the 'Remove Document Parts' it worked for me. Best,
Nils