The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
tokenize
Hi
I am using rapidminer to try to tokenize a column in a database which contains text data.
I want to keep the ID with the Text column so instead of:
ID TEXT
12 I love data mining
it would appear as
ID TOKEN_TEXT
12 I
12 love
12 data
12 mining
Can I do this with the 'Process Documents from Data' cos the output is either the word list (with no ID even though I have set the role of ID as ID) or exampleset containing the ID. But I need both together!
Is there a way of doing this?
Note: reason for doing this is so I can then join to a list of words that tell me the sentiment (if any) related to each word.
THanks in advance
I am using rapidminer to try to tokenize a column in a database which contains text data.
I want to keep the ID with the Text column so instead of:
ID TEXT
12 I love data mining
it would appear as
ID TOKEN_TEXT
12 I
12 love
12 data
12 mining
Can I do this with the 'Process Documents from Data' cos the output is either the word list (with no ID even though I have set the role of ID as ID) or exampleset containing the ID. But I need both together!
Is there a way of doing this?
Note: reason for doing this is so I can then join to a list of words that tell me the sentiment (if any) related to each word.
THanks in advance
0
Answers
did you consider using the "Split" operator? Since your task does not seem to include text processing tasks, I would not use the tokenize approach, since word lists and vectors have different aims than just dividing words.
If splitting is not enough, you can add the "De-Pivot" operator to create a table form similar to the one you posted as example. Here a little process to illustrate the use of both operators: Regards
Matthias
The SPLIT operator correctly seperates the string as I expected.
However, I have large volumes of source data and this seems to take much time to run so I may look at moving this functionality back into the database (oracle).
Many thanks
Brian