The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Compare clustering performance

ahootanhaahootanha Member Posts: 69 Learner III
edited December 2018 in Help

Hello
How can I compare two kmeans and dbscan clustering algorithms and say what is better on a given data, for example? What criteria should I use?

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @ahootanha - there are several operators available to evaluate cluster performance:

     

    Screen Shot 2018-04-06 at 9.15.11 AM.png

     

    And if you go to any of these operators, there are tutorials on how to use them:

     

    Screen Shot 2018-04-06 at 9.15.41 AM.pngScreen Shot 2018-04-06 at 9.15.54 AM.png

     

    Scott

  • ahootanhaahootanha Member Posts: 69 Learner III

    Hello, thank you very much for being grateful and guiding me
    Yes, I know this. But I do not know how and by what criteria to compare two methods of clustering kmeans and dbscan and say which one is better.
    ???
    Thankful

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    What @sgenzer is suggesting is that there are multiple ways of comparing different clusters, and there isn't one single definition of which cluster is "better".  This is even more true for clustering than with predictive models, because clustering is generally an unsupervised approach, so you don't know in advance what the outcome should look like.  If you do any general reading about clustering performance, you will see that there is a lot of discussion in this field about what constitutes the "best" clustering solution for any given dataset and clustering method, and there is no universal agreement.  So it depends on your use case and the goals of your project: what are you trying to accomplish with the clustering?  For example, is it better if the observations in each cluster are more like each other, or is it better to have fewer clusters?  No one on the forum can answer those questions for you, we can simply point you to the tools in RapidMiner that will help you understand and evaluate your clusters using a number of widely used methods.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • ahootanhaahootanha Member Posts: 69 Learner III

    hi

    How and according to what criteria, what is the best performance on my data?

     

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @ahootanha

     

    I will try to explain further what previous commenters have pointed out. 

    Clustering result is subjective in the sense that you should understand what result and what kind of clusters separation you are expecting, and this is fully dependant on the domain and type of the dataset.

     

    Have a look at the eaxmple plots below, where I performed clustering on the same dataset, but with different number of clusters (with k=2, k=3 and k=4):

     

    2 clusters.png

    3 clusters.png

    4 clusters.png

     

    Technically, all three results are valid, as data points are pretty well separated into clusters. You cannot say looking just at these plots that one of them is 'better' than other. You should also understand, what exactly this data represents and how exactly do you want to cluster it, given the nature an dthe domain.

     

    But as soon as you know that this example is an Iris dataset where we know beforehand contains 3 different species to distinguish between, then the right number of clusters is 3. But at the same time clustering with 2 clusters only also makes sense, though it obviously reveals only 1 group of species which is significantly different from another. What it does not reveal is the further differences in the second group.

     

    This said, you really need to formulate the business (or scientific, or whatever else) problem before you do clustering, and interpert the result having this particular question in mind.       

  • jabrajabra Member Posts: 20 Learner III

    Hello
    Is it possible to conclude such a clustering of text?
    And is it possible to take a photo of the process of used operators?
    How to use kmeans with map clastering on labels?

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    Sure it is possible; however I never accompliched this task myself. But still you can find pretty much posts in the community regarding text clustering: https://community.rapidminer.com/t5/forums/searchpage/tab/message?advanced=false&allow_punctuation=false&q=text%20clustering 

  • jabrajabra Member Posts: 20 Learner III

    Thanks a lot
    I went very far but I did not find. can you help me?

     

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    Maybe I could only come up with some ideas, in case you can share your dataset and describe clearly the goal you want to achieve by performing clustering on it.  

  • pschlunderpschlunder Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Hi,

    maybe another view on the performance metrics for clustering. These methods are often based on descriptive statistics or just a mapping between data in a cluster and a number (data based/inherent metrics). Based on the final number of a single cluster alone you can rarely decide if something is good or not. It is often that context provides insight. E.g. the comparison of those numbers between different cluster techniques, settings or clusters.

     

    A simple example would be the shortest distance between cluster boarders. Just knowing that two clusters are apart a certain value it would be hard to decide if the clusters are separating in a sufficient way, because the distance depends on the given attributes space metrics. But knowing that other clusters are apart a bigger number would help you understand that the clustering task might be easier due to the bigger gaps inbetween clusters.

     

    Regards,

    Philipp

Sign In or Register to comment.