The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Kmeans clustering in Text data"
Hi,
After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.
Thanks
Ratheesan.
After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.
Thanks
Ratheesan.
Tagged:
0
Answers
KMeans uses some properties of the euclidean distance to simplify the KMedoids algorithm. This speeds up calculation, but limits the distance measure to be euclidean. Normally euclidean distance is not the best for high dimensional data text data. Usually the cosine similarity is used. But if you receive meaningful results, everything should be fine and you might go ahead with KMeans.
Greetings,
Sebastian