choose best cluster number
Hi
I have this chart for find best cluster number based on davies bouldin index and kmeans algorithm....i don't have local minimum in this chart, should I choose 7 cluster?? why ??? what should we do when we don't have local minimum?
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
With high dimensional data, it can be hard to know what the "best" number of clusters is and visual inspection of the data usually does not work. Unless you have an a priori preference for a specific number, you often will look for the tradeoffs between adding additional clusters and the marginal improvement in some global fitness metric (like the DB index), which is often referred to as the "elbow method" of cluster selection, as described here: https://en.wikipedia.org/wiki/Elbow_method_(clustering)
Based on that logic, I would probably select k=7 from your results, since the benefit of adding additional clusters is minimal (and thus there is a significant inflection point and change in slope at that point in the graph).
1
Answers
Hi @shiva1,
Maybe a first step, is to perform an Exploratory Data Analysis to determine visually how many cluster there are. (you
go to the Charts panels and you can represent graphically your data.
A second approach is to use the DBSCAN operator (an other clustering method) who does not need
to have the number of cluster k as entry parameter.
I hope this first response elements will be useful.
Regards,
Lionel
Hi @shiva1,
To estimate the right number of k, we can use the Bayesian Information Criterion (BIC).
I have tested an algorithm based on this criterion on the well known dataset "Iris" which contains 3 class :
The algorithms conclude that the right number of clusters was 3, so I think it can be relevant.
So I propose to you, to share your dataset in order to execute this algorithm on your dataset
to have more information.
Regards and happy new year 2018 !
Lionel
Hi @lionelderkrikor
thanks
but i have text data and dbscan is not a good choice for text mining...cause it usually turn only one cluster
Hello. Excuse me a question that has engaged my mind
If in the operator performance by distance
Choose the maximaization option
In this case, according to the first post chart
k = 3 is the best value?
That is better db with high value?
Thank you for asking me questions
Hi @student_compute
"clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm" -Wikipedia.
The Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin.
My attached process is an optimization to pick the best K for K-means model, which returns k=3 has the lowest D-B index. You can also try X-mean to get an optimized clustering.
The D-B index was multiplied by -1 internally for maximizing it. You could ignore the negative sign from the performance output.
why is DBSCAN not a good option to apply on text data?