The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Increasing text categorization performance through dedicated wordlists"
I have been playing with text categorization over the last few months and I now have a question for which I could not find an answer here on the forums or somewhere else.
My text categorization models have an accuracy of around 62% (SVM, with SVD for dimensionality reduction)
I want to try to improve this by "helping" the learner a little bit. For a category 'Product related' I know all possible products (something RapidMiner - of course - does not know). Another example would be a list of swear words for tagging cases with a category 'Flame'.
Is it possible to help the leaner by connecting or relating wordlists to certain categories?
Thanks for your help!
My text categorization models have an accuracy of around 62% (SVM, with SVD for dimensionality reduction)
I want to try to improve this by "helping" the learner a little bit. For a category 'Product related' I know all possible products (something RapidMiner - of course - does not know). Another example would be a list of swear words for tagging cases with a category 'Flame'.
Is it possible to help the leaner by connecting or relating wordlists to certain categories?
Thanks for your help!
Tagged:
0
Answers
if you really want to create rules, you could use Process Documents with binary term occurences and then use Generate Attributes and Filter Examples to assign labels manually and apply the model only on the remaining documents which are not covered by the manual rules.
Best regards,
Marius
Thank you, I had not thought about that approach. But does that mean that it is not possible to help the model by giving it a list of words with a strong relation to a certain label?
Because with the manual assigning of labels I think I will encounter issues with cases that contain specific words from multiple labels.
How would I deal with this?
What you could do, however, is to generate a new attribute which contains the result of the "classification" by keywords as described in my post above, and use that attribute additional to the normal word vector for the creation of the SVM model.
Best regards,
Marius