The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
WordList (Process Documents from Data): word count
Using Process Documents from Data operator we get - as Wordlist - a table with: the list of words with Total occurences and Document Occurences.
However we also get - in a sample process "Applying a Model to categorize Documents (under RM Academy) additional columns for classes/categories, in the above mentioned process 2 columns named unknown and food/beverage/hospitality.
When you use Wordlist to Data the columns are labelled with: inclass (unknown) etc.
I get all zero values in both columns, no matter which vector creation method I use ( I use Term Occurences). What shall be changed to get the words counted for both classes.
Thank you.
0
Answers
I have been running the process your were referring to - assuming this is the one - I haven't been able to reproduce the issue. Can you share your process or send a screenshot? See details on how to do this here: https://community.rapidminer.com/discussion/37047
Did you watch the related video? https://academy.rapidminer.com/learn/video/applying-a-model-to-categorize-documents
Thanks, Knut
I finally found the time to look into it. The "0" values are caused by the "Extract content" operator in "Process Documents from Data". Go into the Parameters of that operator and untick the first entry called "extract content". If you do that and run the process again then you will see that the columns get populated and show you the total occurrence for each of the two classes ("unkown" and "food/beverage..."). That output could be used for example to generate a custom pruning mask to reduce the data of the class which is not of interest but I guess there are also other creative options.
You are now probably wondering why the extract content operator is causing the empty values and my answer is: I don't know. But without having more details I'd say it feels like a bug to me so I will send this to our developers. Hope this helps!
Cheers, Knut