The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Problem with filtering the text"
hi everyone
i want to filter a txt document and remove the stopwords.i just put the procces read document,then tokenize.then filter stopwords and then write document but the result is the same.The stop words did not removed.the xml is here.no broblem or warning found just the result forlder is the same just like the text i put in the read document.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="341" width="681">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="36" y="86">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\negative.txt"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="176" y="88"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="317" y="79">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\stopwords_greek.txt"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="447" y="75">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\result\ρεσσσσσσσσσσσσσσσσσ"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
anynone that could help me.the stopwords funtion is the dictionary cause i use greek characters.
i want to filter a txt document and remove the stopwords.i just put the procces read document,then tokenize.then filter stopwords and then write document but the result is the same.The stop words did not removed.the xml is here.no broblem or warning found just the result forlder is the same just like the text i put in the read document.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="341" width="681">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="36" y="86">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\negative.txt"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="176" y="88"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="317" y="79">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\stopwords_greek.txt"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="447" y="75">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\result\ρεσσσσσσσσσσσσσσσσσ"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
anynone that could help me.the stopwords funtion is the dictionary cause i use greek characters.
Tagged:
0
Answers
Edit: Please note the workaround for the rather unpleasant operator behaviour posted by colo which is probably the solution for your problem.
Original message:
make sure to select an appropriate encoding for your greek symbols in your process/operators.
Apart from that it's hard to tell what's wrong as the process works fine for me (of course on my own data files)..
Did you step through the process (selecting an operator and pressing F7 creates a breakpoint after the operator) to see where the process stops doing what you want it to do?
Regards,
Marco
this is a problem of the Document data type, that I was confronted with earlier (mentioned it here: http://rapid-i.com/rapidforum/index.php/topic,2126.0.html).
You can modify whatever you want, finally the original document content is used (for "Write Document" for example, or operators like "Extract Information"). Intended behavior or not, this is a fact that made the data type mostly unusable for me, I am always converting to example sets and doing my work on the columns instead of documents. But documents still offer more possibilites and operators for text mining tasks (like the ability to handle multiple matches from regular expressions or xpath ("Cut Document") or stopword filters etc.).
The only way I found to use the modified document content is the encapsulation in one of the "Process Documents" operators using the option "keep text". This results in an example set, which again has to be transformed to write a document as file (Extract Macro, Create Document, Write Document for example). BUT the "funny" thing with this is the following: if you place your operator chain inside a "Process Documents" operator suddenly the modified output is used for "Write Document". It should work if you modify your example this way: I would prefer a more flexible usability of the document type. I expected exactly the same behavior as you did, but got confused and still don't know why it's working this way. Why should I modify documents if the output is always only the original content? Why is the modified content just used inside "Process Documents" and not every time?
Best regards
Matthias
oh.. I stumbled upon this problem a while ago in a different context where I had to use Documents to create a new web plugin operator, but I did not know that this affects more operators which use Documents..
I will bring this up as soon as possible.
Regards,
Marco