Counting Emojis in Text Mining
Hello - there was a good question about how to do text mining with strange characters such as emojis. I like to do a little "ETL jujitsu" when I work with text data like this, converting the text temporarily to unicode/UTF-8 Hex to get unique, easily parsed tokens, and then converting back. Here's the idea:
1. Import your example set of text data:
2. Get your master set of emojis (I got them from here) and then put them into an Excel doc or whatever. I like putting the Unicode in brackets so I can find it easily + tokenize if desired (see "Unicode RM" column):
3. Use the Encode URL to convert your text to UTF-8 Hex, Replace the UTF-8 Hex to Unicode or whatever with your Excel Dictionary, and then convert back:
Voilà - perfect conversion (well not bad anyway!)
If you want to put that in a process that counts emojis, just add on some text mining using Process Documents From Data and join back with the original data set:
Thanks to user @gjagiello for the data and the inspiration!
Scott
[process attached for those that want to take a look]
Comments
If in the given example the unicodes range from 1F601 to say 1F64F, then you can get this in one go using the replace operator as follows
If in the given example the unicodes range from 1F601 to say 1F64F, then you can get this in one go using the replace operator and entering [\u1F601-\u1F64F] -> replace with somespecialthingy