The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

SOLVED: How to handle efficiently large number of classes in models

juliojulio Member Posts: 17 Contributor II
edited November 2018 in Help
Hi,

I have been using Rapidminer and Analytics for quite some time, and the product is really great. Congratulations. After a lot of ETL, I am starting to use models.

My first serious model, consisted in recreating a deterministic model for examples with a high number of classes (thousands) (and millions of examples). Performance and efficiency is key for the implementation.
Therefore, what I did was to create a model using a tree algorithm, setting my own weights. After finetuning the default parameters, the creation of the model worked really great (as long as you don´t want to show the model, as the representation of the tree takes forever and more). Still, I could translate this into rules and verify that the result was fine.

The problem came when applying the model using the "Apply Model" moderator. What this operator does is to create also a probability for each class. This results in an explosion of data in my case. I admit that probably so many classes are not that frequent, but I cannot imagine they are so particular. So, I suppose there must be some way to be able to handle this.

I actually recreated the model with "old fashioned" programming using Rapidminer (it´s a kind of b-tree look-up mechanism), and I could get as quick as 50-70 "predictions" (or mappings) per second, using my quadprocessor, standalone. I would expect that the model mechanism would give me at least a 5x increase on that.

Thankful for any insight...

Julio

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Julio,

    does the "data explosion" impose any problems, e.g. in terms of memory consumption? If not, you can simply remove the confidence attributes by using a Select Attributes with the following settings:
    regular expression = confidence, include special attributes, invert selection. This will clean your dataset.

    As a further remark let me add that to verify that the decision tree is doing a good job you do not need to manually create rules, but you can simply use a cross validation (X-Validation in RapidMiner). If you are not familiar with this concept, a quick google or wikipedia search will provide you with a good overview.

    Best regards,
    Marius
  • juliojulio Member Posts: 17 Contributor II
    Hi Marius,

    Thank you for the answer.

    Indeed based on the relative large number of classes (thousands), this does become in practice, a memory problem. I understand that of course I could throttle the number of entries (which would be hundreds of thousands) to go through the model, but... I was wondering if there would be more efficient options... (I do understand the rational, but looks like the approach for the apply model operator with large nr of classes is not "elegant", whatever that word means in data-analytics... :-))
    I induce from your answer, there isn´t. How about an apply model without confidence attributes?

    FYI, the reason for me to program things, was to see what the performance would be without using the model.
    I will still check performance with model, but this data-explosion is really a breaker... (for this specific context).

    Thank you!

    Julio
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Julio,

    unfortunately there is no possibility to prevent the models in RapidMiner from creating the confidence attributes, but I get your point that for your specific use case it is not very handy.
    However I can't believe that you can an acceptable accuracy with one single model for so many classes. Without knowing anything about the underlying concepts of your data it is hard to give any additional help, but maybe it is possible to combine some of the classes to reduce the amount of possible outcomes, and create a kind of hierarchical model, that first predicts one of the combined classes, and a second model then digs deeper to identify the original classes within one combined class?

    Best regards,
    Marius
  • juliojulio Member Posts: 17 Contributor II
    Thanks Marius,

    The point is that I have a set of rules that determine 100% always the correct result (by definition). I also understand your point, this is not much of a prediction model....

    Thanks again!

    Julio

Sign In or Register to comment.