The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"K-means: Finding optimum # of clusters with Davies-Bouldin index"
I am clustering text from a discussion forum using k-means. I have followed the sample process called "09_KMeansWithPlot" (thanks Ingo!) to determine the optimum number of clausters via the following measures: (W) Avg Within Cluster Distance and (DB) Davies-Bouldin Index.
My understanding is that the DB index "is a function of the ratio of the sum of within-cluster (i.e. intra-cluster) scatter to between cluster (i.e. intercluster) scatter. A good value for the number of clusters is associated to lower values of this index."
That being said I am having trouble interpreting my results...
My understanding is that the DB index "is a function of the ratio of the sum of within-cluster (i.e. intra-cluster) scatter to between cluster (i.e. intercluster) scatter. A good value for the number of clusters is associated to lower values of this index."
That being said I am having trouble interpreting my results...
- Why are some of my DB values negative infinity?
- Some of my DB graphs have a gentle negative slopes - How do I know where the optimum number of clusters is because it appears there is no "elbow" in the trend line?
- Why do some of the charts only plot a certain number of clusters? For example the x-axis shows, 2,12,22,etc. instead of all the clusters, 1, 2, 3,...22 etc.?
- Are there any rules of thumb I should keep in mind when using the DB index against text data?
Tagged:
0
Answers
just a quick guess from your descriptions: is it possible, that some of your clusters are empty? That could explain why you have infinity values, and also why some clusters are not shown.
Best regards,
Marius
Thanks for the suggestion. I read one of your previous posts (Avoid empty clusters in Cluster Model) to see if I can identify if I have any empty clusters. Like the previous post, I too am generting prototypes by looping through the k-means parameters.
I can't seem to insert the correct operators used in your prior post ("Declare Missing Value" and "Filter" operators) to see I have any empty clusters. I have attached a copy of my process for you to look at.
What is the accepted approach when dealing with empty clusters? Simply remove them? By removing empty clusters should I expect to see a complete DB graph (the reason for developing this Rapidminer process in the first place)?
Thanks for the advice Marius!
Paul
when referring to another post, the easiest way to make in retrievable by others is to post a link to the topic.
I can reproduce the problem, but for reference to the topic you mention please post a link here.
Btw, if you are dealing with natural language (e.g. english), you should consider to add a Stemming operator to your document processing.
Regarding the Davies-Bouldin-Index I created an internal ticket requesting to discuss how to deal with empty clusters. Until that is fixed, you have to work around that as described in the other topic which I currently can't find.
Best regards,
Marius
My apologies for not includng the link I was referring to in the previous post: http://rapid-i.com/rapidforum/index.php/topic,5689.msg20111.html#msg20111
...And thanks for submitting the internal ticket!
I took your advice and added the stemming operator to my process but I still end up with empty clusters.
Can you recommend any other clustering operators that deal with the empty cluster issue that won't throw off a Davies-Bouldin plot (or similar type of plot for selecting an optimum number of clusters)?
Thanks again, Marius!
Concerning the empty clusters, the solution provided in the other thread does not work in your case, since you are not interested in the prototypes themselves, but want to calculate the performance with the Performance operator. I am not sure if one of the other clustering implementation in RapidMiner can guarantee non-empty clusters, just give them a try. They are found in the same operator group as k-Means.
Best regards,
Marius
I turns out that k-Means won't help me determine the optimum number of clusters due to the empty clusters produced...which is still useful information.
Do you know of a way/process to optimize the number of clusters with DBSCAN using "epsilon" and "min pioints" parameters? I could not find a looping operator I can use with DBSCAN like you can with k-Means.
Paul
You might find these posts interesting.
http://rapidminernotes.blogspot.com/search/label/ClusterValidity
The first one uses DBScan and the one labelled IV a clustering result as a classifier.
regards
Andrew