The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Filter tokens (by Pos Tags) without generating n-grams
HeikoeWin786
Member Posts: 64 Contributor I
in Help
Dear all,
I am performing process documents from data using tokenize, transform cases, filter tokens by length, filter stopwords (English), stem(Porter) and filter token (by pos tags).
It is taking so long to run like almost 6 hours.
I am not sure if I am doing things incorrectly.
May I know if it is ok to use Filter tokens (by Pos Tags) without generating n-grams? or, we must generate the n-grams first?
thanks
I am performing process documents from data using tokenize, transform cases, filter tokens by length, filter stopwords (English), stem(Porter) and filter token (by pos tags).
It is taking so long to run like almost 6 hours.
I am not sure if I am doing things incorrectly.
May I know if it is ok to use Filter tokens (by Pos Tags) without generating n-grams? or, we must generate the n-grams first?
thanks
0
Best Answer
-
jacobcybulski Member, University Professor Posts: 391 UnicornI think this is happening because Porter (alike Snowball) stemmer is algorithmic and does not create parts of speech tags. For the POS filter to work you may need to use a dictionary-based stemmer, such as WordNet. Try to skip the POS filter and see if this makes any difference.1
Answers
Thanks a lot. Based on your input, I did some research and I am very much clear now.