The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
interpreting the sum of TF-IDF scores of words across documents
LindsayKelevra
Member Posts: 5 Learner I
hi guys! after doing a clustering on a list of documents with the k-means, I would like to analyze the words in each cluster (in order to correlate them with other attributes). About this I added up the value of tf-idf for each words, but I think that this solution can be wrong. Could it be more correct to use term frequency? Thnaks in advice.
0
Answers
Dortmund, Germany
Dortmund, Germany
If you want to use word your vector values directly, you should use one of the metrics that is inherently additive such as term occurrences, which is just a raw count of terms, or term frequency, which is just the unadjusted percentage of total terms that a particular term covers.
But I also agree with Martin that this is not the most intuitive way of trying to understand your clusters. You can use some of the methods he describes, or you can also just look at the centroid values directly (one of the outputs of the cluster operators) and find the values that are most distinct from one cluster to another (the graph visualization is helpful for this).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts