The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Problem Mandarin Text mining - HanMiner
Hi everyone,
I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.
I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...
Any ideas to solve that issue? Or another was to text mine Chinese documents?
Any help would be much appreciated!
Yoyo
I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.
I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...
Any ideas to solve that issue? Or another was to text mine Chinese documents?
Any help would be much appreciated!
Yoyo
Tagged:
1
Best Answer
-
jwpfau Employee-RapidMiner, Member Posts: 303 RM EngineeringHi,
the third party HenMiner Extension has no option to define the encoding of the imported file, as a workaround you could use Macros:<?xml version="1.0" encoding="UTF-8"?><process version="10.1.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="10.1.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="open_file" compatibility="10.1.002" expanded="true" height="68" name="Open File" width="90" x="112" y="34"> <parameter key="resource_type" value="URL"/> <parameter key="filename" value=""/> <parameter key="url" value="https://us.v-cdn.net/6030995/uploads/editor/sf/nq6mm23abhpa.txt"/> </operator> <operator activated="true" class="multiply" compatibility="10.1.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="85"/> <operator activated="true" class="text:read_document" compatibility="10.0.000" expanded="true" height="68" name="Read Document (2)" width="90" x="380" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="UTF-8"/> </operator> <operator activated="true" class="text:documents_to_data" compatibility="10.0.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="514" y="34"> <parameter key="text_attribute" value="text"/> <parameter key="add_meta_information" value="false"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="use_processed_text" value="false"/> </operator> <operator activated="true" class="extract_macro" compatibility="10.1.002" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34"> <parameter key="macro" value="text"/> <parameter key="macro_type" value="data_value"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value="text"/> <parameter key="example_index" value="1"/> <list key="additional_macros"/> </operator> <operator activated="true" class="hanminer:read_document" compatibility="1.0.003" expanded="true" height="68" name="Read Document" width="90" x="782" y="136"> <parameter key="encoding" value="UTF-8"/> <parameter key="import_from_file" value="false"/> <parameter key="text" value="%{text}"/> <parameter key="file" value="C:/Users/Rui/Downloads/archive (6)-chinese/chinese-dataset-subset.txt"/> </operator> <operator activated="true" class="hanminer:tokenize" compatibility="1.0.003" expanded="true" height="68" name="Tokenize" width="90" x="916" y="136"> <parameter key="high_speed_mode" value="false"/> </operator> <connect from_op="Open File" from_port="file" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Read Document (2)" to_port="file"/> <connect from_op="Multiply" from_port="output 2" to_op="Read Document" to_port="file"/> <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/> <connect from_op="Documents to Data (2)" from_port="example set" to_op="Extract Macro" to_port="example set"/> <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document set"/> <connect from_op="Tokenize" from_port="document set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Greetings,
Jonas0
Answers
Scott
Yes it is.
I'm trying to install the following but no success so far.
https /github.com/joeyhaohao/rapidminer-Hanminer
Nothing happens at step 4 when I try to install the extension.
I am also trying to look at other options but it is harder than I expected...
Any help would be great, cheers!
Yoyo
I'm going to cc my good friend and colleague @yyhuang who will know a LOT more about this than I do.
Scott
Dortmund, Germany
After installing manually by inserting unzipped .jar file into my local extension folder C:\Users\Yy\.RapidMiner\extensions and a restart, everything is working fine. Hi @YoGVA you can follow the instructions here https://community.rapidminer.com/discussion/31996/install-extensions-manually-for-rapidminer-studio
Six new operators added into the new extension folder "Text Miner"
A quick test on the news data looks reasonable.
<pre class="CodeBlock"><code>
<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value="yhuang@rapidminer.com"/> <parameter key="process_duration_for_mail" value="1"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text_miner:read_text" compatibility="1.0.000" expanded="true" height="68" name="Read Text" width="90" x="112" y="34"> <parameter key="encoding" value="SYSTEM"/> <parameter key="text" value=" 这是默认的文本 每年到了这个时候,市场经济学家都会发布对未来12个月的详细宏观预测。令我自己都讶异的是,我正在为进行这项困难尝试的第五个十年画上句号,到目前为止离完美的成功预测还差得很远。经济以及市场的重大动荡可不会整整齐齐地把自己挤进一个自然年。"/> <parameter key="import_from_file" value="false"/> </operator> <operator activated="true" class="text_miner:tokenization" compatibility="1.0.000" expanded="true" height="68" name="Tokenization" width="90" x="313" y="34"/> <operator activated="true" class="text_miner:filter_stopwords" compatibility="1.0.000" expanded="true" height="68" name="Filtering" width="90" x="514" y="34"/> <operator activated="true" class="text_miner:word_count" compatibility="1.0.000" expanded="true" height="68" name="Word Count" width="90" x="782" y="34"/> <connect from_op="Read Text" from_port="output" to_op="Tokenization" to_port="text"/> <connect from_op="Tokenization" from_port="text" to_op="Filtering" to_port="text"/> <connect from_op="Filtering" from_port="text" to_op="Word Count" to_port="text"/> <connect from_op="Word Count" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>My apologies if I should open a new question. My question is related to the latest version of Hanminer v.1.0.3. I noticed that the READ TEXT operator is now named READ DOCUMENT.
My problem is when I import from file using this operator, the chinese characters became unidentified symbols.
I have tried several ways:
1. I tried using the different encodings listed and have installed chinese character in my windows pc but no difference.
2. I imported the dataset as an example set and used DATA TO DOCUMENTS operator as below. However, I received an error.
3. I tried connecting DATA TO DOCUMENTS operator to the READ DOCUMENT operator but this resulted in wrong input/output connection.
Perhaps, @yyhuang can help shed some light here. Really appreciate it.
Thank you kindly.
have you tried to change encoding to UTF-8?
Greetings,
Jonas
Yes, I have, but still nothing.