The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Tonkenize on Chinese
xiaobo_sxb
Member Posts: 17 Contributor II
Does anybody know how to tokenize Chinese (or Japanese, Korean etc). The current operator in text processing extension works for English quite well but does not work for Chinese.
Steven
Steven
0
Answers
Please let us know your results!
I learned how to do text analytics after watching your vedio. It's nice for English. But for Chinese, I don't know how to tokenize. For example, here is one Chinese sentence 这是一个关于如何实现文本分析的视频. The problem is, in natual, there is no blank or other non-letter character to separate the sentence. If I translate it to English, it is: "这(This)是(is)一个(a)关于(about)如何(how to)实现(realize)文本(text)分析(analytics)的()视频(video)" of course the sentence should be re-ordered. The operator "Generate n-Grams (Characters)" does not work for Chinese. It create a lot of n-grams items but most of them are meaningless. I'm new to this area, not sure whether I did it correctly or not. I simply process the document by extract content and generate n-grams. If you have some real example to share, that's wonderful.
try this
replace the question mark characters with your chinese text in the create document operator, that's just how it turned out in rapidminer xml it splits the document into all of its characters, and produces a wordlist with those characters, and creates an exampleset with those characters as well
这 -> This
是 -> is
一句 -> a
范文 -> example
Furthermore, there is no rule for how many characters should be categorized together, fully depends on the context. Or even the same characters can be categorized differently in different contexts.