Problem Mandarin Text mining - HanMiner

YoGVA · December 2019

Hi everyone,

I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.

I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...

Any ideas to solve that issue? Or another was to text mine Chinese documents?

Any help would be much appreciated!
Yoyo

jwpfau · May 2023

Hi,

the third party HenMiner Extension has no option to define the encoding of the imported file, as a workaround you could use Macros:

<?xml version="1.0" encoding="UTF-8"?><process version="10.1.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="10.1.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="10.1.002" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
        <parameter key="resource_type" value="URL"/>
        <parameter key="filename" value=""/>
        <parameter key="url" value="https://us.v-cdn.net/6030995/uploads/editor/sf/nq6mm23abhpa.txt"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="10.1.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="85"/>
      <operator activated="true" class="text:read_document" compatibility="10.0.000" expanded="true" height="68" name="Read Document (2)" width="90" x="380" y="34">
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="UTF-8"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="10.0.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="514" y="34">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="use_processed_text" value="false"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="10.1.002" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34">
        <parameter key="macro" value="text"/>
        <parameter key="macro_type" value="data_value"/>
        <parameter key="statistics" value="average"/>
        <parameter key="attribute_name" value="text"/>
        <parameter key="example_index" value="1"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="hanminer:read_document" compatibility="1.0.003" expanded="true" height="68" name="Read Document" width="90" x="782" y="136">
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="import_from_file" value="false"/>
        <parameter key="text" value="%{text}"/>
        <parameter key="file" value="C:/Users/Rui/Downloads/archive (6)-chinese/chinese-dataset-subset.txt"/>
      </operator>
      <operator activated="true" class="hanminer:tokenize" compatibility="1.0.003" expanded="true" height="68" name="Tokenize" width="90" x="916" y="136">
        <parameter key="high_speed_mode" value="false"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Read Document (2)" to_port="file"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Read Document" to_port="file"/>
      <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/>
      <connect from_op="Documents to Data (2)" from_port="example set" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document set"/>
      <connect from_op="Tokenize" from_port="document set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Greetings,
Jonas

sgenzer · January 2020

hi @YoGVA I'm sorry no one has chimed in here. Is this still an issue?

Scott

YoGVA · January 2020

Hi Scott,

Yes it is.

I'm trying to install the following but no success so far.
https

/github.com/joeyhaohao/rapidminer-Hanminer
Nothing happens at step 4 when I try to install the extension.

I am also trying to look at other options but it is harder than I expected...

Any help would be great, cheers!
Yoyo

sgenzer · January 2020

hi @YoGVA hmm never seen that repo before!

I'm going to cc my good friend and colleague @yyhuang who will know a LOT more about this than I do.

Scott

MartinLiebig · January 2020

Hi @YoGVA ,

here is a compiled version of the github version, which you can just unzip and copy to .RapidMiner/extension. This works, but i have not tested the operators of course.

Best,

Martin

yyhuang · January 2020

Thanks for sharing the compiled extension. Dr @mschmitz !
After installing manually by inserting unzipped .jar file into my local extension folder C:\Users\Yy\.RapidMiner\extensions and a restart, everything is working fine. Hi @YoGVA you can follow the instructions here https://community.rapidminer.com/discussion/31996/install-extensions-manually-for-rapidminer-studio

Six new operators added into the new extension folder "Text Miner"

A quick test on the news data looks reasonable.

Image: https://us.v-cdn.net/6030995/uploads/editor/uz/ojkthq4hg00s.png

Image: https://us.v-cdn.net/6030995/uploads/editor/g0/irwzey7lq9fd.png

<pre class="CodeBlock"><code><?xml version="1.0" encoding="UTF-8"?><process version="9.5.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value="yhuang@rapidminer.com"/> <parameter key="process_duration_for_mail" value="1"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text_miner:read_text" compatibility="1.0.000" expanded="true" height="68" name="Read Text" width="90" x="112" y="34"> <parameter key="encoding" value="SYSTEM"/> <parameter key="text" value="
这是默认的文本
每年到了这个时候，市场经济学家都会发布对未来12个月的详细宏观预测。令我自己都讶异的是，我正在为进行这项困难尝试的第五个十年画上句号，到目前为止离完美的成功预测还差得很远。经济以及市场的重大动荡可不会整整齐齐地把自己挤进一个自然年。"/> <parameter key="import_from_file" value="false"/> </operator> <operator activated="true" class="text_miner:tokenization" compatibility="1.0.000" expanded="true" height="68" name="Tokenization" width="90" x="313" y="34"/> <operator activated="true" class="text_miner:filter_stopwords" compatibility="1.0.000" expanded="true" height="68" name="Filtering" width="90" x="514" y="34"/> <operator activated="true" class="text_miner:word_count" compatibility="1.0.000" expanded="true" height="68" name="Word Count" width="90" x="782" y="34"/> <connect from_op="Read Text" from_port="output" to_op="Tokenization" to_port="text"/> <connect from_op="Tokenization" from_port="text" to_op="Filtering" to_port="text"/> <connect from_op="Filtering" from_port="text" to_op="Word Count" to_port="text"/> <connect from_op="Word Count" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>

ruhaila · May 2023

Hi.

My apologies if I should open a new question. My question is related to the latest version of Hanminer v.1.0.3. I noticed that the READ TEXT operator is now named READ DOCUMENT.

My problem is when I import from file using this operator, the chinese characters became unidentified symbols.

Image: https://us.v-cdn.net/6030995/uploads/editor/t4/apmd2b6737co.jpg

I have tried several ways:
1. I tried using the different encodings listed and have installed chinese character in my windows pc but no difference.

Image: https://us.v-cdn.net/6030995/uploads/editor/pc/l4wyvkeb8vdl.jpg

2. I imported the dataset as an example set and used DATA TO DOCUMENTS operator as below. However, I received an error.

Image: https://us.v-cdn.net/6030995/uploads/editor/jr/ng5rhdp966ir.jpg

3. I tried connecting DATA TO DOCUMENTS operator to the READ DOCUMENT operator but this resulted in wrong input/output connection.

Image: https://us.v-cdn.net/6030995/uploads/editor/bk/46dbt53jjhlp.jpg

Perhaps, @yyhuang can help shed some light here. Really appreciate it.

Thank you kindly.

jwpfau · May 2023

Hi,

have you tried to change encoding to UTF-8?

Greetings,
Jonas

ruhaila · May 2023

Hi Jonas,

Yes, I have, but still nothing.

ruhaila · May 2023

Thank you Jonas. That worked fine. Didn't think of macros here.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Problem Mandarin Text mining - HanMiner

Best Answer

Answers