Cluster Distance Performance Davies Bouldin fails on certain data sets
Hi,
The Davies Bouldin index calculation fails and returns infinite numbers. This happens if you have cluster centroids that are not assigned any values. This must happen, if your data set has less rows than your cluster model has centroids. But it also happens by chance that k-Means (strangely) delivers cluster centroids that are not used at all. (I would like that not to happen, too. Probably bad initialization.)
The reason is in Line 250 of CetroidBasedEvaluator:
// averaging by cluster sizes and sum over all
for (i = 0; i < numberOfClusters; i++) {
withinClusterDistance[i] /= clusterSizes[i];
}
clusterSizes is then 0, if no example has been assigned, causing the distance to become NaN. There is simply needed a check! Not dividing in this situation would be fine.
Greetings,
Sebastian
Comments
Hi @land,
shouldn't be the expected result for a cluster with no associated examples a missing value?
BR,
Martin
Dortmund, Germany
Hi Martin,
this is a performance measure and the situation is comparable with a classification, where one class is not used and your accuracy becomes infinity.
I would also argue, that the average intra cluster distance is rather 0 than missing, but one can debate that. However, the Davis Bouldin Index should still be calculatable.
So: No, this is just a plain bug, I'm pretty sure
Greetings,
Sebastian
Hi @land,
i think i disagree for the construction of a single DB value. I mean it's something like
where i denotes an example. This is simply not defined if cluster ins a null set. If a function value is not defined, that it should be coded as NaN.
Another story is, how do we average 10 DB values where one is a NaN. I think there we do something "odd". NaNs should be exlcuded in the averages. The only question is if you divide by 10 or 9.
Dortmund, Germany
Hi @mschmitz,
there's no Davies Bouldin per example. The examples of a cluster are only used for the intra cluster distance. The value per cluster, which is later averaged to the final Davies Bouldin, is computed as maximum over all combinations with other clusters of the ratio of the intra cluster distance, to the inter cluster distance.
Hence the problem is that an entire cluster is never used (meaning k-Means implementation has a problem with initialization). That causes the intra cluster distance to become NaN as it is 0/0. But then the max check fails. If a plain 0 intra cluster distance would be returned, the cluster would usualy not have an impact as it will unlikely being the max.
Hence it would be fine to not change the average at the end at all.
Greetings,
Sebastian
Hi,
so the problem is the distance to the other clusters. Still fishy. The other cluster does not really exist. Maybe one should just add the distance to the cluster center and not the distance to the points? I just fear that adding not counting (i.e adding a 0) is a strong bias to the measure.
BR,
Martin
Dortmund, Germany