The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Tokenize operator issue - help request

amitdamitd Member, University Professor Posts: 49 Maven
edited February 2022 in Help
I have to process some documents where the double exclamation !! when followed by a word should be an individual token by itself (e.g., sentence!! as a token, not 'sentence' and '!!' separate). Similarly, the smiley character : ) is expected to be a separate token. When I use the non-letters mode in Tokenize, the words get extracted okay but not the way I would like. When the mode = regular expression is used with the expression as [a-zA-Z!:)]+ it does not work at all. I tested the regular expression in the expression builder and it works okay when each document text is tested in its preview. However, the output of the process ends up being blank. I have no clue why this is happening. I have attached the two processes. Can someone please help?

The expected output would be (counts not shown).
: ) (I have added a space between colon and ) otherwise the editor converts it to a smiley emoji like this :)
a
all
another
here
is
last
new
of
sentence
sentence!! 
sentences
this
yet


Best Answer

  • amitdamitd Member, University Professor Posts: 49 Maven
    Solution Accepted
    I figured out the issue. Here, we have to use a regular expression that are tokens used for separating, not what we expect to keep. So the regular expression should be [ .,]+ and then it works fine. 
Sign In or Register to comment.