The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Tokenization vs N-grams

HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor I
Hello guys,

I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?

Thanks and regards,
Heikoe

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    Solution Accepted
    n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.

     Bi-grams will do fine for sentiment, anything more isn't typically give much added value.

Answers

  • HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor I
    @kayman

    Thanks for your clarification here.
    Meaning to say that, we use Bi-grams as a part of data pre-processing.
    i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?

    Thanks and regards,
    Heikoe
Sign In or Register to comment.