The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

K-means cluster with text data

joen841030joen841030 Member Posts: 8 Contributor I
edited November 2019 in Help
Hello experts! 

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997



Also, I wonder is it possible to use something like 
Silhouette  scores to define the ideal number of cluster? Thank you!!!

Best Answer

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @joen841030,

    You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) : 

    https://community.rapidminer.com/discussion/comment/61654#Comment_61654

    Hope this helps,

    Regards,

    Lionel
  • joen841030joen841030 Member Posts: 8 Contributor I
    Hi @lionelderkrikor
    thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...

    PerformanceVector:
    Avg. within centroid distance: -385.889
    Avg. within centroid distance_cluster_0: -393.196
    Avg. within centroid distance_cluster_1: -351.386
    Avg. within centroid distance_cluster_2: -410.075
    Avg. within centroid distance_cluster_3: -384.852
    Avg. within centroid distance_cluster_4: -403.787
    Avg. within centroid distance_cluster_5: -371.171
    Avg. within centroid distance_cluster_6: -366.001
    Avg. within centroid distance_cluster_7: -402.358
    Davies Bouldin: -0.500

    And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...

    Thanksss so much in advance!




  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @joen841030,

    Why did you think that theses results are incorrect ?

    Regards,

    Lionel
  • joen841030joen841030 Member Posts: 8 Contributor I
    Hi @lionelderkrikor,
    Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!
Sign In or Register to comment.