The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Text Processing - Tokenize: keep word order
CharlieFirpo
Member Posts: 48 Contributor II
Dear All!
Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.
So how can I tokenize in a way that remains the original word order?
Thank you!!
Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.
So how can I tokenize in a way that remains the original word order?
Thank you!!
Tagged:
0
Answers
How about the following regards
Andrew
It works perfectly! I changed the 'mode' parameter at Tokenize operator in Cut Document to 'specify characters = . ,;:' in order to handle numbers as well at the input text.
Nice day!