The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Creating model to categorize data

MarlaBotMarlaBot Employee-RapidMiner, Member Posts: 57 Community Manager
edited February 2019 in Help
A RapidMiner user wants to know the answer to this question: "I have a list of about 120 values that serve as categories. I have to be able to predict what category a value belongs to based on it's other attribute. The values that I am training on are associated with one of these items. I need to create a model that will categorize the combination of values from other columns and predict what category it belongs in. I have tried to use a decision tree and it does not seem to be doing very well. There are too many categories and it keeps making poor predictions. Any suggestions? Thank you."

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,
    is there any way to use a taxonomy between the 120 classes?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    There are probably too many categories and not enough cases in many of them for the algorithm to detect patterns all at once.  You have a couple of options:
    • Create groupings of these categories (this is the taxonomy that Martin mentioned above) so you end up with a much smaller number of super-categories and try to build a model to predict those.  Ideally you would have pretty robust counts in each of the super-categories and not too many of them (e.g., 12 would be much better than 120!).
    • Find the dominant categories (once again by count) and create a series of "one vs all other" models.  This would require you to build multiple models but will give you more control over the specific categories selected.
    • Or you could do a hybrid of the two methods above.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.