The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Why UTF-8 is not working?

heron_oliveiraheron_oliveira Member Posts: 6 Learner I
Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Solution Accepted
    Hi there,
    what operator do you use to read the text file? It should have a setting as well.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • heron_oliveiraheron_oliveira Member Posts: 6 Learner I
    My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...
  • heron_oliveiraheron_oliveira Member Posts: 6 Learner I
    I would also like to know how to must be the stop words list format. Since there is no Portuguese stop words operator, I made a list document, but I don't know if it accepts list format or if it should be dictionary or something else.
  • heron_oliveiraheron_oliveira Member Posts: 6 Learner I
    Exactly, I was changing enconde in the settings > preferences. But in fact I should've done it on the operator settings. Thanks!
Sign In or Register to comment.