The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Average silhouette vs sum of squares vs average within distance vs davies bouldi
mariozupan
Member Posts: 15 Contributor II
I am trying to get optimal k-means clusters. I got the next values of some cluster performance operators:
Average silhouette (it needs to be closer to 1)
0.436629028918996 2
0.3082759533058591 3
0.28166001017015313 4
0.2642004909716735 5
0.2687266594105881 6
0.20684027606885227 7
0.20938717797555279 8
0.1989215746446572 9
0.2159248335388874 10
0.20862824967813512 11
0.21515776961871466 12
0.22229187379304438 13
sum of squares (closer to 0 is better)
0.5833789973221948 2.0
0.37635053401019425 3.0
0.2637793240351113 5.0
0.22072765997043042 6.0
0.1775519277095977 7.0
0.13894604067369032 9.0
0.13183279742765275 10.0
0.11787512536057321 11.0
0.12043920141437744 12.0
0.11111340867029912 14.0
0.0978794677474385 15.0
Davies Bouldin (closer to O is better)
2.0 0.9380429190179411
3.0 1.2019021137767585
5.0 1.223643902662405
6.0 1.133405289202767
7.0 1.0968281280723653
9.0 1.1200633376736615
10.0 1.1979846345568537
11.0 1.1630894077266136
12.0 1.2048150524976373
14.0 1.120210017075379
15.0 1.1432560808642207
Average within distance: (closer to 0 is better)
2.0 0.06534949797998725
3.0 0.05185423744778253
5.0 0.03845893742628533
6.0 0.03339595659274747
7.0 0.02958406174889975
9.0 0.02536301492397515
10.0 0.02418196109649237
11.0 0.022728641391481907
12.0 0.0218420365992699
14.0 0.019696264589330038
15.0 0.01864628535658701
Neither one of my performance operator is not so happy with my distribution. I tried to remove outliers, done logarithm on attributes, normalize from 0 to 1 and get the next results for 5 clusters:
attributes cluster1 cluster2 cluster3 cluster4 cluster5
X222 0.832614470761885 0.6164551892773821 0.6682251804332917 0.5019367377913034 0.6709872198085056
X333 0.4813816731397629 0.8084517968969477 0.4073744166141768 0.4418416403356408 0.5815675749379681
X444 0.7072093106534784 0.6221056454535794 0.17922575220116604 0.10192647980428186 0.278179549313975
X111 0.7444156633161193 0.755888014090719 0.6086095238148184 0.3923249690067086 0.7476506411572069
How to improve performances? Does specfic results of shapiro-wilks test, ANOVA test or t-test, will give me a better k-means clusters?
Could you please, please show me the way, I really need a help.
Average silhouette (it needs to be closer to 1)
0.436629028918996 2
0.3082759533058591 3
0.28166001017015313 4
0.2642004909716735 5
0.2687266594105881 6
0.20684027606885227 7
0.20938717797555279 8
0.1989215746446572 9
0.2159248335388874 10
0.20862824967813512 11
0.21515776961871466 12
0.22229187379304438 13
sum of squares (closer to 0 is better)
0.5833789973221948 2.0
0.37635053401019425 3.0
0.2637793240351113 5.0
0.22072765997043042 6.0
0.1775519277095977 7.0
0.13894604067369032 9.0
0.13183279742765275 10.0
0.11787512536057321 11.0
0.12043920141437744 12.0
0.11111340867029912 14.0
0.0978794677474385 15.0
Davies Bouldin (closer to O is better)
2.0 0.9380429190179411
3.0 1.2019021137767585
5.0 1.223643902662405
6.0 1.133405289202767
7.0 1.0968281280723653
9.0 1.1200633376736615
10.0 1.1979846345568537
11.0 1.1630894077266136
12.0 1.2048150524976373
14.0 1.120210017075379
15.0 1.1432560808642207
Average within distance: (closer to 0 is better)
2.0 0.06534949797998725
3.0 0.05185423744778253
5.0 0.03845893742628533
6.0 0.03339595659274747
7.0 0.02958406174889975
9.0 0.02536301492397515
10.0 0.02418196109649237
11.0 0.022728641391481907
12.0 0.0218420365992699
14.0 0.019696264589330038
15.0 0.01864628535658701
Neither one of my performance operator is not so happy with my distribution. I tried to remove outliers, done logarithm on attributes, normalize from 0 to 1 and get the next results for 5 clusters:
attributes cluster1 cluster2 cluster3 cluster4 cluster5
X222 0.832614470761885 0.6164551892773821 0.6682251804332917 0.5019367377913034 0.6709872198085056
X333 0.4813816731397629 0.8084517968969477 0.4073744166141768 0.4418416403356408 0.5815675749379681
X444 0.7072093106534784 0.6221056454535794 0.17922575220116604 0.10192647980428186 0.278179549313975
X111 0.7444156633161193 0.755888014090719 0.6086095238148184 0.3923249690067086 0.7476506411572069
How to improve performances? Does specfic results of shapiro-wilks test, ANOVA test or t-test, will give me a better k-means clusters?
Could you please, please show me the way, I really need a help.
0
Answers
What if I remove silhouette negative values, as I read somewhere?
removing outliers is certainly a good idea, and for k-means normalization is a must. I usually go for the Z-Transformation (see Normalize operator). The tests of course only measure the performance, they don't influence the result of the clustering.
You could experiment with different distance measures in k-Means, sometimes they have quite an impact on the results.
Best, Marius
Could you explain me how kmeans operator in Rapid and R give me so different average silhouette performance, I will repeat: with the same dataset preprocessed in Rapid ?