The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to compare the performance of several clustering algorithms?
Dear All,
How to compare the performance of several clustering algorithms?
Weka provides a validation method called "classes to cluster evaluation".
This method basically does classification trough clustering.
Which is nice when your dataset contains a "class" attribute.
But what if your datasets on which you benchmark don't not contain any nominal attributes?
To me the natural solution now is to measure missing value replacement accuracy.
So split the data into a training and test set. Remove a random attribute value from each sample in the test set.
And try to predict back these removed attribute values.
Does anyone know a paper which uses this approach?
Is there some other approach which is typically used?
Best regards,
Wessel
How to compare the performance of several clustering algorithms?
Weka provides a validation method called "classes to cluster evaluation".
This method basically does classification trough clustering.
Which is nice when your dataset contains a "class" attribute.
But what if your datasets on which you benchmark don't not contain any nominal attributes?
To me the natural solution now is to measure missing value replacement accuracy.
So split the data into a training and test set. Remove a random attribute value from each sample in the test set.
And try to predict back these removed attribute values.
Does anyone know a paper which uses this approach?
Is there some other approach which is typically used?
Best regards,
Wessel
Tagged:
0
Answers
Well I'm facing a similar problem right now, I think we're pointing to the same thing, so I'll give you my opinion:
When I build 2 clustering models, I want to know which is the best, but, it depends mainly on the business problem; i.e, sometimes you would have 3/4 clusters to explain a general behaviour of your clients to a Marketing Manager.
But, if you pick up one model, and you like the segments, and you want to test the performance of it for beeing sure of that segmentation is representative for further clustering (i.e., clustering data for the next month) , I think, (and here is my answer/question for others ) that the validation would be to compare the distributions of each variable for the training and testing data set. So, if the distributions are similar for each variable, you can asset that the clustering model catch up the pattern.
We can do some testing and share the results,
Best regards,
Pablo.
If an algorithm returns a very similar distribution on the training and test set, then the algorithm performs good?
But how does a clustering algorithm return distributions?
It returns clusters.
And the clusters on the training set will be different from the clusters on the test set.
Best regards,
Wessel
Well, I know that R/RM, for example, creates a new Clustering model each time that you run it, but, I'm testing another software Powerhouse Analytics (1), and in it each time that you create a Clustering modelling, it stores the "weights" that produces the segmentation, so you can compare Training Set Vs Test Set.
Now, respect to distributions and comparisons between clusters, I was refering to the posibility of comparing distribution of one variable in training and test set.
If the clustering model changes every time that it executes, what would be the necessity for testing it? What would be the measures for comparing performances?
I think comparing variables distributions (or Inter Quartile Ranges) between clusters would be one (of many) answers, i.e.
Cluster 1 has the inter quarile range between 20 and 30.
Cluster 2 has the inter quarile range between 35 and 55.
I could say that the variable Age is very discriminative. On the other hand, if you have overlapping ranges, that variable is not very discriminative.
Thank you,
Best regards,
Pablo.
(1) Unfortunally, this web page is in spanish, but you can set the software language to english: http://www.dataxplore.com.ar/tecnologia.php
You can store cluster assignments and cluster models and cluster weights in Rapid Miner also.
I think what you are suggestion about quartile range is related to the measures:
- within cluster similarity
- between cluster distance
Best regards,
Wessel