The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Bigram Document Vector"
Hi,
I want to create document vector consisting only of bigrams.
For this I am first saving the wordlist using the following operators:-
TextInput
StringTokenizer
and then I am using
TextInput
StringTokenizer
TermNgramGenerator
StopWordFilterFile (using the previously saved wordlist.)
Is there any better way of doing this?
I want to create document vector consisting only of bigrams.
For this I am first saving the wordlist using the following operators:-
TextInput
StringTokenizer
and then I am using
TextInput
StringTokenizer
TermNgramGenerator
StopWordFilterFile (using the previously saved wordlist.)
Is there any better way of doing this?
Tagged:
0
Answers
I guess it is. Why don't you just use it this way:
TextInput
StringTokenizer
TermNgramGenerator
The resulting vector will only contain the bi-grams, since it builds the vector from the tokens generated by all inner operators. If no token of a complete word is contained, it will not be part of the vector.
Or did I misunderstand you completely?
Greetings,
Sebastian
I tried using only
TextInput
StringTokenizer
TermNGramGenerator
The problem I am facing is that along with the bigrams, unigrams are also coming to the document vector. I want only bigrams not unigrams. So to prevent this I have to use the StopWordFilter to remove the unigrams.
Plez let me know if I can achieve this in a much better way?
Not really understanding much about anything I looked up on Wikipedia to understand what a bigram was, and found the following... Here http://en.wikipedia.org/wiki/N-gram
What is the context of your application?
For me the bigram should composed of sequence of words.
e.g. For - "the dog smelled like a skunk" bigrams should be xx_the, the_dog, dog_smelled, smelled_like, like_a, a_shrunk, shrunk_xx.
with RapidMiner 5.0 the result should exactly look like what you are expecting it to be. Otherwise there's no possibility to change this, but you could filter the not desired results using the example filter.
You might specify a regular expression for filtering the attribute according to their names.
Greetings,
Sebastian