The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Data Mining and KMeans
Greetings!
I'm new to data mining, and I'm currently interested on learning kMeans... and I've got some questions for you guys.
My sample dataset consists of 49 records, each having 60 attributes/values.
I want to learn how the computation and assignment for the means/centroids is done.
I would also like to ask if my operators for this clustering algorithm are correct:
Root
|__AccessSampleSource (I chose this one because my database format is MS Access 2003)
|__MissingValueReplenishment (set to zero)
|__KMeans
For the visualization, I always choose Scatter Multiple, having the x-axis as the cluster, and some of the attributes (usually 15 attributes) as the y-cluster.
Am I doing it right?
I hope someone could enlighten me soon!
Thank you, and more power to RapidMiner!
I'm new to data mining, and I'm currently interested on learning kMeans... and I've got some questions for you guys.
My sample dataset consists of 49 records, each having 60 attributes/values.
I want to learn how the computation and assignment for the means/centroids is done.
I would also like to ask if my operators for this clustering algorithm are correct:
Root
|__AccessSampleSource (I chose this one because my database format is MS Access 2003)
|__MissingValueReplenishment (set to zero)
|__KMeans
For the visualization, I always choose Scatter Multiple, having the x-axis as the cluster, and some of the attributes (usually 15 attributes) as the y-cluster.
Am I doing it right?
I hope someone could enlighten me soon!
Thank you, and more power to RapidMiner!
0
Answers
for understanding how K-Means works, I would suggest reading the respective wikipedia entry http://en.wikipedia.org/wiki/Kmeans.
Anyway your process setup seems to be quite useful for this setting, but I would change the missing value replenishment method to use the mean value. This way the missing values will differ least from the other values during distance calculation. Otherwise examples with missing values could be assigned to a single cluster, just because they have missing values.
If you find this visualization helpful go on, but I guess using two attributes for x and y axis and using the color for the cluster assignment is much more intuitive.
As a general hint I would suggest to upgrade to RapidMiner 5, which has a lot more power
Greetings,
Sebastian
With regards to K-Means, Most examples on web sites have only two attributes, and it's easy to visualize or learn how they have done the process (from the selection of centroids to the grouping of records).
I really want to simulate (manually) the K-Means process with my own data set, but I don't know how (or where) to start because of its 60 attributes. And that alone leaves me confused. How will I evaluate this kind of data set?
p .s.
Thanks for the answers on my first post, Sebastian.
edit:
Finally found a website whose example has multiple attributes.
Is the SVD Reduction operator necessary for every k-Means process?
In RapidMiner's k-Means algorithm, are the centroids randomly selected per iteration?
of course the SVDReduction is not obligatory for clustering! It's just a method for reducing the dimensionality of your data set. There are many more methods for this like PCA, ICA and so on, but you don't need one at all. It might be useful, depending on your data, but it might also hurt. It will be extremely hard to draw conclusions from a clustering of reduced data, because you don't have any original attribute left.
To your second question:
Yes, RapidMiner initializes a KMeans run with a random centroid. To avoid having it lying outside the boundaries of data, one example is chosen as centroid per cluster.
Greetings,
Sebastian