The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Transforming output from Process Docs to create a word list/document
Hi there...
We have a challenge to create word/tag clouds from a database system...
Easy I thought, create a table with the first column being Document ID, another column for the word and then a third column as the count of that word in the document (we probably wouldn’t use the 3rd column, but just in case). In this way we could create a very quick word cloud no matter what the user selects as the subset of documents.
So I have set up the job in Rapid Miner, reading the records from the database including only the Document ID and the full text field, passed it through the Process Documents element (tokenise, transform case, filter stop word, filter tokens, stem)... Job done...
Unfortunately no... and here is my problem.
The data that comes out from the Process Document element has the Document ID as the first column, but then every word that is found is the name of the remaining columns... I have looked at Transpose and Pivot, but neither of these do what I need....
We did think about saving the output as CSV and then doing something outside of Rapid Miner, but it would then mean it will be a manual process rather than something I can automate hourly to deal with new records.
Any thoughts or ideas will be most appreciated.
We have a challenge to create word/tag clouds from a database system...
Easy I thought, create a table with the first column being Document ID, another column for the word and then a third column as the count of that word in the document (we probably wouldn’t use the 3rd column, but just in case). In this way we could create a very quick word cloud no matter what the user selects as the subset of documents.
So I have set up the job in Rapid Miner, reading the records from the database including only the Document ID and the full text field, passed it through the Process Documents element (tokenise, transform case, filter stop word, filter tokens, stem)... Job done...
Unfortunately no... and here is my problem.
The data that comes out from the Process Document element has the Document ID as the first column, but then every word that is found is the name of the remaining columns... I have looked at Transpose and Pivot, but neither of these do what I need....
We did think about saving the output as CSV and then doing something outside of Rapid Miner, but it would then mean it will be a manual process rather than something I can automate hourly to deal with new records.
Any thoughts or ideas will be most appreciated.
Tagged:
0
Answers
did you try the operator "De-Pivot"? This should do the job as far as I can tell from your description.
Cheers,
Ingo
Thanks for the reply.. I have had a quick look and it could work if the list of words (and therefore the columns/attributes) stayed the same... but the list of words already is large and having to set up the attributes in the de-pivot task would take a very long time each time the job was run.
I have had a quick look at the Cut Document operator, and it would appear to do what I want, expect it does not allow for any other meta data to be passed through so I cannot tell what document the words relate to.
Any suggestions you can make would be really appreciated.
Chris
Could also be a possible approach. Maybe you could multiply the data before, use Cut Document in one path and join both data sets afterwards?
Cheers,
Ingo
Did you ever solve your challenge? I'm trying to do the same thing but without success.
If I use ".*" like Ingo suggests I get the following error ???
'attributes have different value types:no conversion is performed.'
Thanks
Scott
EDIT
I have realised what I was doing wrong now.
Using the following regular expression did the trick
[^id].*