The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
keep unique id when tokenize
when reading a csv file with two columns: ID and MESSAGE, is it possible to keep the ID field when using the operator Process Documents from Data?
I use this operator to tokenize messages but I want to be able to keep the relation between the words and the message with the unique ID column.
So when tokenize the following scentence:
ID scentence
1 Rapidminer rocks the world!
I want the result
ID
1 Rapidminer
1 rocks
1 the
1 world
I use this operator to tokenize messages but I want to be able to keep the relation between the words and the message with the unique ID column.
So when tokenize the following scentence:
ID scentence
1 Rapidminer rocks the world!
I want the result
ID
1 Rapidminer
1 rocks
1 the
1 world
0
Answers
Can you post your process and a small fraction of your data to clarify you problems?
Thanks for you reply.
See links below for screenshots:
http://postimage.org/image/4paepjga3/
http://postimage.org/image/s5eadmbh1/
http://postimage.org/image/8c3c90ahz/
http://postimage.org/image/pvvx4p4br/
Let's say I have 80 000 messages from different users posted all over one year. Now I want to analyze which subjects were hot in a certain time frame by a selected set of users from a certain age. I want to do this with another data visualisation tool in which I can make selections on the fly. To be able to do this I need the relation between the message, the user and the time it is posted.
Now when I tokenize all 80 000 messages, I have one set with the most frequent words but there is no relation which words were used in the message they came from. Just the total count. Is there some sort of way that I can keep the relation with the message?
please do not attach your process and data as screenshots. This isn't helpful for us at all to reproduce your problem. Please read this posting which explains how to provide a process as XML. You can use the code tags as well to attach a small fraction of sample data (this can be a part of your real data or some artificial data with the same problem) which does not work as expected.
Regarding your question: The "Process Documents from Data" operator yields a word vector where each row represents a message and every word with a value greater than zero indicates that this word is contained in this message. And as I said, usually other rows will be retained. If your post your process this will clarify a lot of things I think.
Marcin