The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"tokenize BUG (Text processing)"
hi i am using tokenizer (text processing),
and using 'specify characters' option,
and my specified character's parameters are symbols and numbers (.:@/_",*$#!?^ ()<>+-%'"[]{}~`0123456789)
so i have gotten my tokens as english,
however when i filter the stopwords (english) and stemming (porter),
i figured that it has a bug,
which the results i obtained does not stem words correctly :
for example, "apply", "applies" ---- they are seperated but not combined,
and what more weird is that it generates a new keyword "appli" which never existed in the original documents.
however, when i use tokenizer (non-letters),
the stemming are correct and they all categorized into 'apply' keywords.
is that a bug or anything else?
how am i going to resolve this problem?
i prefer using 'specify character' option because i would like some special character to be retained.
Thanks.
and using 'specify characters' option,
and my specified character's parameters are symbols and numbers (.:@/_",*$#!?^ ()<>+-%'"[]{}~`0123456789)
so i have gotten my tokens as english,
however when i filter the stopwords (english) and stemming (porter),
i figured that it has a bug,
which the results i obtained does not stem words correctly :
for example, "apply", "applies" ---- they are seperated but not combined,
and what more weird is that it generates a new keyword "appli" which never existed in the original documents.
however, when i use tokenizer (non-letters),
the stemming are correct and they all categorized into 'apply' keywords.
is that a bug or anything else?
how am i going to resolve this problem?
i prefer using 'specify character' option because i would like some special character to be retained.
Thanks.
Tagged:
0