The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Is clustering and Decision Tree supposed to take hours to process?"

GViasuRaeisaeneGViasuRaeisaene Member Posts: 1 Learner II
edited June 2019 in Help

Hi, 

 

I'm on a tight schedule and using Rapidminer for the first time. At the moment I have been running Agglomerative Clustering for over 5 hours and I'm not sure if I should just let it run still or if there is something wrong and I'm just wasting my time. My exampleset has 241762 examples and 25 attributes, most of which are polynominal. I ran into the same problem when trying to create a Decision Tree, but I just killed that process after 5 hours. 

 

Thanks,

Geta

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    It's hard to tell without seeing your process and data. Are the polynominals transformed into numbers via dummy coding?  Normally Decision Trees are fast, there must been a problem somewhere. 

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Agglomerative clustering for many examples (rows) is always very slow.  The same is true for decision trees with nominal attributes and massive amounts of possible values.  I would suggest to use the following web site to find out which algorithms are feasible:

     

    http://mod.rapidminer.com/

     

    For clustering, I would try "k-Means (fast)" and even that might easily take some time.  For classification, I would start with Naive Bayes or k-NN which in general are pretty fast algorithms.

     

    Hope this helps,

    Ingo

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    In general I would be wary of using nominal attributes that have a high number of possible values in a predictive model.  Usually these types of attributes do not generalize very well because the patterns that are in the training data are too specific and simply overfit to the training sample.  You might want to consider some kind of feature engineering to reduce the number of possible values by aggregating or combining values in some sensible ways (e.g., 5-digit zip code to region, IP address to country, name to gender, etc.).  

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.