The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Closest encoding to utf8mb4

Robi_MeRobi_Me Member Posts: 32 Maven
I am working with social media data and all those emojis are driving me crazy, when I import them they are getting changed to system encoding and are a bunch of squiggles. What encoding is closets to utf8mb4 so that I can preserve the encoding when reading from a CSV?

Best Answer

  • Robi_MeRobi_Me Member Posts: 32 Maven
    Solution Accepted
    @jwpfau when I am importing into the DB it is failing saying the character is not UTF8 with error message: Incorrect string value: '\xE2 \x94 \x82....'

    This is basically all of the emojis that were being rejected. I was under the impression that I needed to set the encoding inside of Rapid Miner, however it was a change that was needed on the DB. Changing the free text field to TEXT and making the encoding UTF8mb4 sorted the issue out. 

Answers

  • kaymankayman Member Posts: 662 Unicorn
    If your csv is utf8 you should just be able to read it when you import it as utf8. Never had any issues with this, as long as the source was properly encoded. 
  • Robi_MeRobi_Me Member Posts: 32 Maven
    Nope, all emojis and non ascii characters are turning into their base encoding. Any idea why encoding is closest to UTF8MB4?
  • jwpfaujwpfau Employee-RapidMiner, Member Posts: 303 RM Engineering
    Hi,

    UTF8MB4 is a workaround for the broken UTF8 type in mysql which only supports up to 3 byte character.
    In the csv export it should be just regular utf-8.

    Maybe the selected RapidMiner Studio font doesn't contain all the smileys and is displaying squares instead?

    Greetings,
    Jonas


Sign In or Register to comment.