The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
operating generate N-Grams (terms)
hi,
I would like to know how the n-grams are generated, I noticed, some words are grouped together as n-gram (terms), and some others are not (single words), how does it decide which terms group together and which not? many of the most frequent occuring terms have no n-gram groupings...
0
Answers
The way n-grams works is like this if you set it to 2. It will make combinations of the following sentence "RapidMiner Studio is the best."
RapidMiner_Studio
Studio_is
is_the
the_best
Assuming your corpus of documents is about RapidMiner Studio reviews and you have TF-IDF set as your word vector creation, it will likely give "is_the" a very low value and "RapidMiner_Studio" and "the_best" as higher values. Of course if you have stemming, filtering, and pruning set, it might just drop out "is_the" completely out, and that's probably what's happening with your process.
well inside process documents operator, I had tokenize, stemming, stopwords and n-gram operator, but this might have been the cause...