Cluster Distance Performance Davies Bouldin fails on certain data sets

land · October 2018

Hi,

The Davies Bouldin index calculation fails and returns infinite numbers. This happens if you have cluster centroids that are not assigned any values. This must happen, if your data set has less rows than your cluster model has centroids. But it also happens by chance that k-Means (strangely) delivers cluster centroids that are not used at all. (I would like that not to happen, too. Probably bad initialization.)

The reason is in Line 250 of CetroidBasedEvaluator:

		// averaging by cluster sizes and sum over all
		for (i = 0; i < numberOfClusters; i++) {
			withinClusterDistance[i] /= clusterSizes[i];
		}

clusterSizes is then 0, if no example has been assigned, causing the distance to become NaN. There is simply needed a check! Not dividing in this situation would be fine.

Greetings,

Sebastian

MartinLiebig · October 2018

Hi @land,

shouldn't be the expected result for a cluster with no associated examples a missing value?

BR,

Martin

sgenzer · October 2018

land · October 2018

Hi Martin,

this is a performance measure and the situation is comparable with a classification, where one class is not used and your accuracy becomes infinity.

I would also argue, that the average intra cluster distance is rather 0 than missing, but one can debate that. However, the Davis Bouldin Index should still be calculatable.

So: No, this is just a plain bug, I'm pretty sure

Greetings,

Sebastian

MartinLiebig · October 2018

Hi @land,

i think i disagree for the construction of a single DB value. I mean it's something like

\sum_i db(i) for each i \elem cluster

where i denotes an example. This is simply not defined if cluster ins a null set. If a function value is not defined, that it should be coded as NaN.

Another story is, how do we average 10 DB values where one is a NaN. I think there we do something "odd". NaNs should be exlcuded in the averages. The only question is if you divide by 10 or 9.

land · October 2018

Hi @mschmitz,

there's no Davies Bouldin per example. The examples of a cluster are only used for the intra cluster distance. The value per cluster, which is later averaged to the final Davies Bouldin, is computed as maximum over all combinations with other clusters of the ratio of the intra cluster distance, to the inter cluster distance.

Hence the problem is that an entire cluster is never used (meaning k-Means implementation has a problem with initialization). That causes the intra cluster distance to become NaN as it is 0/0. But then the max check fails. If a plain 0 intra cluster distance would be returned, the cluster would usualy not have an impact as it will unlikely being the max.

Hence it would be fine to not change the average at the end at all.

Greetings,

Sebastian

MartinLiebig · October 2018

Hi,

so the problem is the distance to the other clusters. Still fishy. The other cluster does not really exist. Maybe one should just add the distance to the cluster center and not the distance to the points? I just fear that adding not counting (i.e adding a 0) is a strong bias to the measure.

BR,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Cluster Distance Performance Davies Bouldin fails on certain data sets

Fixed and Released · Last Updated October 2019

Comments