The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
how to process multiple MS Word into Rapidminer?
Dear All
I want to process multiple MS Word files.
If I use 'Process Documents from Files' as per the tutorial, the file content looks corrupted. For example, file name: helloworld.docx, with the content of only 2 words: hello world. Rapidminer will produce a trunk of unrelated words as output.
I understand I can use 'read office file' to read the MS Word documents into exact content, however, this extension can use for 1 file at a time only.
How do I mingle between these 2 processing tools or if there are additional tools I could use? Because either I do 'read office file -> process documents from files -> res' OR 'process documents from files -> read office file -> rex' does not seems computer logic.
My ideal objective is to load a batch of MS Word files for Readability analysis. Such as using SMOG, FOG etc indexes to check the readability of mass contents, so I can gather more data samples for a university research paper.
Thanks a lot!
I want to process multiple MS Word files.
If I use 'Process Documents from Files' as per the tutorial, the file content looks corrupted. For example, file name: helloworld.docx, with the content of only 2 words: hello world. Rapidminer will produce a trunk of unrelated words as output.
I understand I can use 'read office file' to read the MS Word documents into exact content, however, this extension can use for 1 file at a time only.
How do I mingle between these 2 processing tools or if there are additional tools I could use? Because either I do 'read office file -> process documents from files -> res' OR 'process documents from files -> read office file -> rex' does not seems computer logic.
My ideal objective is to load a batch of MS Word files for Readability analysis. Such as using SMOG, FOG etc indexes to check the readability of mass contents, so I can gather more data samples for a university research paper.
Thanks a lot!
0
Answers
Dortmund, Germany
how do i setup the parameters for 'loop file' operator to load multiple MS Word into Rapidminer?
The setting i did is 'loop file' - 'read office file' - rest
Loop file:
Directory: C:/Users/user/Downloads/t1
filter type: Glob
Filter by glob: .*doc
Enable parallel execution
if filter by glob is .*doc, "not enough iterations: the minimum number of iterations must not be smaller than 1.
if filter by glob is: *.doc, error type: input is missing, the previous operator loop file did not product any output.
There are 3 files in the t1 folder, 2 .doc file and a .docx file
I also looked up on google how to use Loop File, however the 2018 youtube videos parameter setting seems no longer valid with the current version....
Looking forward for your replies
With thanks!
Kevin
Dortmund, Germany
I tried with what we discussed, what's still missing?
Please see screenshot attached, thanks.
read office file parameter is default with detect file type. thanks.
(There are only 2 doc files in the t1 folder)
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="9.8.000" expanded="true" height="82" name="Loop Files" width="90" x="514" y="34">
<parameter key="filter_type" value="regex"/>
<parameter key="filter_by_regex" value=".*docx"/>
<parameter key="recursive" value="false"/>
<parameter key="enable_macros" value="false"/>
<parameter key="macro_for_file_name" value="file_name"/>
<parameter key="macro_for_file_type" value="file_type"/>
<parameter key="macro_for_folder_name" value="folder_name"/>
<parameter key="reuse_results" value="false"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="operator_toolbox:read_word_files" compatibility="2.8.000-SNAPSHOT" expanded="true" height="68" name="Read Office File" width="90" x="246" y="34">
<parameter key="detect_file_type" value="true"/>
<parameter key="file_extension" value="docx"/>
</operator>
<connect from_port="file object" to_op="Read Office File" to_port="file"/>
<connect from_op="Read Office File" from_port="doc" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Add directory here</description>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="715" y="34">
<parameter key="add_meta_information" value="true"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="use_processed_text" value="false"/>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Dortmund, Germany