The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

K-Means Clustering for Text

svtorykhsvtorykh Member Posts: 35 Maven
edited November 2018 in Help

Hi RM Team! I have a quck question about application of K-Means clustering for text.

 

I have a set of ~2000 comments. Once I'm done with Text Processing (using TF-IDF) I have a word vector matrix of ~30 terms.

 

I then apply K-means operator, but I wonder what actually serves as input for clustering? Is it vector matrix? If so, does clustering algorythm uses values from TF-IDF Word Vectors or some other values?

 

Best Answer

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Exactly, it is the word vector matrix that is used.  So if you created the vector using TF-IDF, it will use those values.  You also have the option of using other methods to create the vector like binary term occurrences or term frequency percentage.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • svtorykhsvtorykh Member Posts: 35 Maven

    Thanks much!

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Your cluster will be based on the pruned values of the word vector.  If you are interested in the details you should be able to review the actual values for each cluster on the centroid table output of the k-means operator.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.