The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How should I decide the best K number ( 3 or 4 or 5)? How could I recognize the best Davies Bouldin?

GeorgeOik1999GeorgeOik1999 Member Posts: 10 Contributor II
Hi all, I'm working in a project in Business Intelligence lesson. I choosed a clustering problem using the optimization parameters. The problem I face is that my results are not so interpretable. Any model I tried brought me nothing but chaos on scater plots and the clustering visualization .  It might be the data quality or the preparation I made. The predictions i want to create are the consumers preferables based on GameType ( genre) , platforms , sales and a range of games realesing year. Here is a sample of my data.


Any advices about my data preparation(in order to conduct interesting results) will be appreciable.
Thanks in advance.

Best Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited January 2021 Solution Accepted
    It looks that your data is a mix of nominal and numerical values, before you apply k-means clustering you may need to decide if your nominal values should be converted to numerical and then you can rely on the numerical measures such as Euclidean or Cosine, or if you keep them the way they are and use Mixed measures (which are more primitive). Also eliminate missing values.

    Then you may run the optimisation loop, to vary k against some cluster performance measure, such as Davis-Bouldin. An important principle of running optimisation is to ensure that any variations within the loop can be traced back to the changes in k and not some other random effects. If the random effects are at play you will often get a zig-zag performance plot. In k-Means the process starts with random positioning of centroids, and so this could greatly influence the final clustering and the final performance measure.

    One way to eliminate this is to increase the "max runs" parameter of k-means, so that more possible starting points would be tried. Another approach is to set the "local random seed" to some specific value so that the k-means initialisation process always starts in the same way, and so the only aspect changing would be k. The best is to apply both. Also you may need to play with the "max optimisation steps", as this parameter may stop the k-means optimisation prematurely.

    When you rely on Davis-Bouldin cluster distance performance, it is always the DB value closer to zero that is optimum (depending on the performance settings the chart may be inverted up or down, but it is always the values closest to zero which matters). At the same time, it makes no sense selecting the deep DB hole surrounded by peaks of bad performance, as when you apply the model to new data, you may end up in a very bad clustering for that data. So a good heuristic is to look for the area or relative stability and closest to zero, pick the mid point and use it as your k-DB pair. 

    However, DB clustering performance works very well on "smooth" (no steps as in integer or binomial dummy variables) numerical attributes. It may also be unstable when the data is not "convex" (no multi-dimensional valley to slide into). So it needs testing - if your DB performance chart is all zig-zags up and down, or always descending towards zero with no well defined bend, you will not be happy with Davis-Bouldin.

    What are the options there? Always keep it in mind the objective of your clustering. If you are looking for the best segmentation to support a marketing campaign, your company may not be able to deal with or afford to process 100 segments, perhaps only 5. If you are looking for clustering to reduce dimensionality of your data then several hundreds clusters will be just fine. It means that it is not the overall optimum k you'd be looking for but the best k in a range, say up to 20. In the latter case the selection of DB value is simpler.

    If all fails (e.g. your data is "concave"), collect Itrem Distribution performance SumOfSquares (within cluster) and rely on the informal "elbow" method by looking at the k vs WSS, where you may find the point k, after which the increase in the complexity of your clustering system (as defined by k) no longer yields a significant improvement in performance (as measured by WSS). If it looks like an elbow point you are a winner, if the elbow is not sharp, you may need to argue your decision.

    Good luck -- Jacob

    (Sorry for the long lecture)
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited January 2021 Solution Accepted
    Normally k-means is guaranteed to converge very quickly, if the process stops, set both Max params to their default values and change only k. Show what's inside your optimization loop. Also take only a smaller sample of your data for optimization. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Solution Accepted
    It is also possible that by changing your nominals to numerical values you may have generated too many dummy attributes. If one of the attributes has 1000s of possible classes, either remove it or change its conversion to integers rather than dummy variables. 

Answers

  • GeorgeOik1999GeorgeOik1999 Member Posts: 10 Contributor II
    Thanks ! I see things clearly now. Also the optimization parameters doesnt run, he stacks at 25% of the process, and i only calculate the DV and K's. Some other processes with the samples ( titanic ) of RM work fine. 
    Thanks in advance.
Sign In or Register to comment.