The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"[SOLVED] Empty Word List"
Hi All,
I am counting the occurrences of words in a txt document. The text document has abstracts of other documents, as well as the document title. The general format of the file is such:
<document name>
<abstract>
<white space>
...
This continues for roughly 36,00 documents. The total size of the document is 46MB. I am expecting to get a word list of word occurrences as a result. What I actually get is an empty word list. Here is my attached process:
Please let me know what I am doing wrong. Thanks.
I am counting the occurrences of words in a txt document. The text document has abstracts of other documents, as well as the document title. The general format of the file is such:
<document name>
<abstract>
<white space>
...
This continues for roughly 36,00 documents. The total size of the document is 46MB. I am expecting to get a word list of word occurrences as a result. What I actually get is an empty word list. Here is my attached process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>I used this youtube video as a guide: https://www.youtube.com/watch?feature=endscreen&;NR=1&v=EjD2M4r4mBM
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="641" width="1024">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="179" y="75">
<parameter key="file" value="C:\Users\Administrator\Desktop\DTIC_RDF\sample.xml"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="447" y="75">
<parameter key="create_word_vector" value="false"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<process expanded="true" height="645" width="1024">
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="125" y="28"/>
<operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="313" y="75"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Please let me know what I am doing wrong. Thanks.
Tagged:
0
Answers
it might be helpful if you check the option "create word vector" in the Process Documents operator
Additionally, you are reading only one document, but your pruning settings are configured to ignore words which appear in less than two documents. So for testing I suggest to disable pruning.
Happy mining,
Marius
After changing options, it is generally a good idea to hit "enter" or click somewhere on the process pane to make sure that the changes are actually submitted. Maybe the options were not applied when you hit the run button (yes, this needs improvement :-\ )
Best, Marius