The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
K-means cluster with text data
joen841030
Member Posts: 8 Contributor I
Hello experts!
I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??
Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997
Also, I wonder is it possible to use something like Silhouette scores to define the ideal number of cluster? Thank you!!!
I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??
Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997
Also, I wonder is it possible to use something like Silhouette scores to define the ideal number of cluster? Thank you!!!
Tagged:
0
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @joen841030,
No, the average within centroid_distance_cluster i is not limited between -1 and +1.
The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.
Here a ressource about average within cluster distance :
https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html
Hope this helps,
Regards,
Lionel
7
Answers
You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) :
https://community.rapidminer.com/discussion/comment/61654#Comment_61654
Hope this helps,
Regards,
Lionel
thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...
PerformanceVector:
Avg. within centroid distance: -385.889
Avg. within centroid distance_cluster_0: -393.196
Avg. within centroid distance_cluster_1: -351.386
Avg. within centroid distance_cluster_2: -410.075
Avg. within centroid distance_cluster_3: -384.852
Avg. within centroid distance_cluster_4: -403.787
Avg. within centroid distance_cluster_5: -371.171
Avg. within centroid distance_cluster_6: -366.001
Avg. within centroid distance_cluster_7: -402.358
Davies Bouldin: -0.500
And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...
Thanksss so much in advance!
Why did you think that theses results are incorrect ?
Regards,
Lionel
Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!