The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to set up model to categorize texts

gstargstar Member Posts: 3 Contributor I
edited November 2018 in Help
Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:

To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.

*the documents are relatively short and contain between 50 and 200 words

So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3.  :-[

Thanks for any input!
Gstar

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Gstar,

    for text mining Naive Bayes or a linear SVM usually do a good job.
    Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.

    Best regards,
    Marius
  • gstargstar Member Posts: 3 Contributor I
    Great. Tanks! I'll try it and report back later!
  • gstargstar Member Posts: 3 Contributor I
    Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
    Naive bayes performs worse.
    I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).

    Is there a workaround?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    The operator Polynominal by Binominal classification is your friend in this case :)

    Best regards,
    Marius
Sign In or Register to comment.