The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
strange behavior of replace tokens operator
simon_knoll
Member Posts: 40 Contributor II
hello all,
im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.
replace est with Eastern_Time
replace dup with duplicates
and hello with hallo
the process documents vector creation is set to term occourences.
the create documents text is :
est
dup
hello
the created wordvector eintails now
Eastern_Time
duplicate
hallo
and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1
i expected a vector where every of the terms has occourence 1
if im exchanging the process documents operator with the process documents from files operator and i write the words
est
dup
hello
in a text file i get the expected beavior with a vector entailing
Eastern_Time
duplicate
hallo
and every term has an occourence of 1
is this a bug?
am i doing something wrong?
all the best
simon
ps: here the workflow with read document
im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.
replace est with Eastern_Time
replace dup with duplicates
and hello with hallo
the process documents vector creation is set to term occourences.
the create documents text is :
est
dup
hello
the created wordvector eintails now
Eastern_Time
duplicate
hallo
and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1
i expected a vector where every of the terms has occourence 1
if im exchanging the process documents operator with the process documents from files operator and i write the words
est
dup
hello
in a text file i get the expected beavior with a vector entailing
Eastern_Time
duplicate
hallo
and every term has an occourence of 1
is this a bug?
am i doing something wrong?
all the best
simon
ps: here the workflow with read document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
<process expanded="true" height="811" width="435">
<operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document (8)" width="90" x="45" y="30">
<parameter key="text" value="est dup hello"/>
<parameter key="label_value" value="jmol"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents (3)" width="90" x="315" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="datamanagement" value="double_array"/>
<process expanded="true" height="811" width="1068">
<operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.0.6" expanded="true" height="60" name="Replace Tokens" width="90" x="514" y="30">
<list key="replace_dictionary">
<parameter key="est" value="Eastern_Time"/>
<parameter key="dup" value="duplicate"/>
<parameter key="hello" value="hallo"/>
</list>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document (8)" from_port="output" to_op="Process Documents (3)" to_port="documents 1"/>
<connect from_op="Process Documents (3)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
thanks for this detailed report. I have found the problem: The Documents delivered to the input ports were directly delivered to the inner process. Since the inner process is passed twice by each document, they where tokenized and replaced two times. Make a break point before the tokenize operator to see this effect.
I have corrected this, it will be delivered with the next regular update.
Greetings,
Sebastian
i was searching today for a workaround and i tried this within the DocumentTextInputOperator: at first sight it worked out, but i think if im doing like that, im messing it up, do you have an advice for a hotfix, as i need this feature really urgent
all the best,
simon
try using this: Think about getting enterprise customer, then you already would have a new release
Greetings,
Sebastian
Thank you!!!
i'll give a try.
all the best, simon