The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Text Mining On rapid miner (Dealing with Smileys)"

macctenmaccten Member Posts: 28 Contributor II
edited June 2019 in Help
Hi,

I have a selection of emails which i wish to tokenize and then apply a custom dictionary to stem
The emails have many many smiley faces and other unorthodox symbols to convey emotions
I was wondering is there a way to use the Stem (Dictionary) operator to achieve this?
For example set in my text file (dictionary) Smile: :)
I have tried this but it keeps falling over. Could you suggest any changes i could implement for my dictionary

Thank you for your time

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    probably your Tokenize operator is configured to cut on anything but letters and numbers, which means that any punctuation marks (from which smileys are composed) will be ignored. Please be sure to define proper tokenization rules. If that does not help, please post your process xml as described in the post linked from my signature.

    Best regards,
    Marius
  • macctenmaccten Member Posts: 28 Contributor II
    Hi Marius

    Sorry to bring this topic up again. I just have a quick question on rapid miner along the same topic.
    Again i have around 1000 documents which contain text speak in each document
    I have created a dictionary for text speak and smileys and it sits in a MYSQL table with a definition format like that in the table below
    :)
    Smiling, Happy
    Is there an operator in rapid miner that will allow me to find exactly the text speak (based on the column in the table i created) and convert it to the normalized English in the second column in the table

    Thanks for your time
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi, the easiest way to do that is with the Replace (Dictionary) operator. Put it into your process before the Process Documents operator.
    Just connect the example set containing the documents to the exa input port. Then load the dictionary with Read Database and connect it to the dic input. Use the from_attribute parameter to declare the first column as search string and to_attribute to define the other column as replacement string.

    Best regards,
    Marius
  • macctenmaccten Member Posts: 28 Contributor II
    Hi Marius

    This is very good but unfortunately if i have a sentence that contains the word URGENT and the dictionary specifies that U is equal to YOU then i get  YOUrgent :)
    Is there a way to just maybe look at the words individually?
    Would it make more sense to tokenize the Documents first using a space as a delimiter?

    Thanks for your time Marius
    Its really appreciated
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You can also try to tokenize first, then, however, you won't replace smileys which are attached directly to a word... however, after tokenization you can use the Stem (Dictionary) operator.

    Best regards,
    Marius
Sign In or Register to comment.