The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Web Mining for Other Languages
krsnewwave
Member Posts: 1 Learner III
Hi!
I have some Japanese pages in my analysis, and I noticed that the operator "Extract Content" isn't friendly with texts other than UTF-8. Is there any way to change how it handles its encoding?
(EDIT: While this question is hanging, I think I'm going to try the complement - removing all html tags and the regions with <script>. It seems to work okay thus far.)
I have some Japanese pages in my analysis, and I noticed that the operator "Extract Content" isn't friendly with texts other than UTF-8. Is there any way to change how it handles its encoding?
(EDIT: While this question is hanging, I think I'm going to try the complement - removing all html tags and the regions with <script>. It seems to work okay thus far.)
0
Answers
I know people who do this sort of thing, they say that a problem with Chinese is tokenising the sentences, as there are no spaces to separate the words, check this out http://www.foreverastudent.com/2012/03/chinese-word-frequency-list-news.html .
It is possible, but not easy!
Good luck.