The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
JAPANESE Tokenizing
turutosiya
Member Posts: 2 Contributor I
Hi.
I'm a niewbie at RapidMiner.
I'm trying to mining some webpages with "GetPage", "Extract Content" And "Process Documents".
It seems work well for ENGLIUSH pages, but for JAPANESE pages, tokenizer doesn't work well,
Japanese tokenize is not supported?
I'm a niewbie at RapidMiner.
I'm trying to mining some webpages with "GetPage", "Extract Content" And "Process Documents".
It seems work well for ENGLIUSH pages, but for JAPANESE pages, tokenizer doesn't work well,
Japanese tokenize is not supported?
0
Answers
not really and as I'm not an expert on Japanese, I don't have a clue how we should do this, they don't have whitespaces, do they?
How is determined where a word ends?
Greetings,
Sebastian
you should also try the Text Processing > Transformation > Generate n-Grams (Characters) operator
Karl Bergerson
Seattle WA USA
karl.bergerson@gmail.com
you are very welcome if you can come up with a good algorithm for japanese tokenization!
With kind regards,
Sebastian
It's beeeeeen a really long time to start this proj. at last, I have time to try.
I'm looking for document which describing API spec for Tokenizer.
does anyone know?
I'm trying to implement a JapaneseTokenizer which work with morphological analysis engine, such as Chasen / Mecab.