The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text processing
I'm wondering if RapidMiner can token Thai sentence into word? if not, how can I filter out Thai character?
Thank you in advance!
Dtip
Thank you in advance!
Dtip
Tagged:
1
Best Answer
-
kayman Member Posts: 662 UnicornNot out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.
I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.
An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.7