The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"universal clustering validation"
IngoRM
Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Original messages from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2036214&;forum_id=390413
Hi,
I have tried that validation->clustering many way. example, ClusterCentroidEvaluator is
only for k-means(needs centroid based learner) and almost every other are for
hierarchical kind of clustering, density estimator- I have not foud learner which
produce FlatClusterModel. Anyway the best measure is that supervised kind comparing the
clustered data and the labeled input data (for classification). Kernel of the problem is
that those validation can't be applied on two different unsupervised learners.
Supervised way for estimation of data error, will produce the best result, but can I do
that? I know it is possible, but can it be done in Rapidminer? Is some kind of
validation in Rapidminer applicable for every learner model and give me good clustering
validity? Or somebody done it in java aditionaly? I have read manual and I can't find any information about that.
Thanks for your replies
Answer by Ingo Mierswa:
Hello,
for supervised cluster evaluation you have at least two options:
1) compare the cluster names with a set of predefined labels.
One could of course ask why one should cluster data which is already labelled.
--> there is currently no such operator available in RapidMiner and you would have to implement something like this yourself. We also added this operator on our todo list.
For smaller number of clusters, there is a workaround using only existing operators without any coding. You could find out which cluster number corresponds best to which label and use the AttributeValueMapper for mapper for mapping the cluster number to the corresponding label. Then, change the cluster attribute role to a prediction by using the ChangeAttributeRole operator and use one of the performance evaluation operators to calculate the performance. The single operator mentioned above could do that automatically, especially the search for the best clustering / label mapping will become cumbersome for larger numbers of clusters.
2) use a cross validation on a supervised learning scheme with the cluster as label and look how good it can be learned.
There is a lot of dicussion about this evaluation method outside (which I will not start here) but at least this can easily be done with the existing operators.
Cheers,
Ingo
Answer by topic starter:
So I have for example iris data and these labels: iris-setosa, versicol.,viginica and I applied K-means which give me clusters classes for example: 2,0,1 (there isn't any order) and may be cluster 2 mistacely split setosa and versicol and clust. 0 have only half of versicol. So if this can be done by this way as you have wrote what I need to write into these block you have talk about?
Edit by topic starter:
The best solussion could be if you will send some xml code. I don't know how to set it up. Firstly I don't know what means "attributes" and "replace what" in AttributeValueMapper. Replace what could be iris-set., iris-virg,iris-vers. and by: 0,1,2 I think. But what means attributes? What means the name in ChangeAttributeRole? Better send some example, it will be more quick then hard explaining to me.
Answer by Ingo:
Hi,
here you go (although you really could try it first to find such a setup - you will learn quicker then ;-):
Please note that the mappings would not be necessary if we would add an operator performing the search for the best mapping. The attribute copy ("attributes" are the same as "features", "variables", often "columns" in RapidMiner) is necessary since the ClusterModel depends on the cluster attribute and we are not simply allowed to change the role of the cluster attribute. Instead of this, you could also copy the complete data set with an IOMultiplier (only the view is copied, not the data) or remove the cluster model with an IOConsumer. You see there are often a lot of options for achieving the same goal in RapidMiner.
Cheers,
Ingo
Answer by John:
Many thanks. Good sophisticated way . This finaly helped. There is not problem to deal with 3 clusters. So I try to find highest accuracy number which I get from Validation block Performance. The results seems too bad. The Best are k-medoids and k-means with 89%. It's very similar to Adjusted rand criteria I think, it has the same table how much objects from some class fit to another class. Interesting is that better solussion when the dataset is not normalized for k-medoid a k-means (with normalisation it is only 82% both)and batch k-means(simple k-means from weka) + x-means better have dataset normalized, better about 1%. So what do you think? Is better use normalized data or not? I was thinking before that normalized data are important. Mainly why there is degradation of clustering quality in k-means and k-medoid with normalization?
Thanks for your help with that validation through that mapping.
Have a nice day
Reagards John
Answer by Anonymous:
hi.
We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use.
Any help is important!
Thanks
Edit by Anonymous:
Sorry...i missed say that we used k-means and now we have to validate it, and we dont know how.
Thanks
Answer by Ingo:
Hello,
> We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use.
You mean beside the ones discussed above? I would suggest that you first try the process specified above. There are also a lot of examples in the sample directory delivered together with RapidMiner about Clustering.
Cheers,
Ingo
Answer by Anonymous:
Yes. But all for supervised classification. We want a unsupervised learner. We investigate that onde measure of validation is SSE, how can we do that in rapid miner? we try ClusterCentroidEvaluator but de DB=-0,56 was the result. What does it mean? Anyone knows?
Thanks.
Hi,
I have tried that validation->clustering many way. example, ClusterCentroidEvaluator is
only for k-means(needs centroid based learner) and almost every other are for
hierarchical kind of clustering, density estimator- I have not foud learner which
produce FlatClusterModel. Anyway the best measure is that supervised kind comparing the
clustered data and the labeled input data (for classification). Kernel of the problem is
that those validation can't be applied on two different unsupervised learners.
Supervised way for estimation of data error, will produce the best result, but can I do
that? I know it is possible, but can it be done in Rapidminer? Is some kind of
validation in Rapidminer applicable for every learner model and give me good clustering
validity? Or somebody done it in java aditionaly? I have read manual and I can't find any information about that.
Thanks for your replies
Answer by Ingo Mierswa:
Hello,
for supervised cluster evaluation you have at least two options:
1) compare the cluster names with a set of predefined labels.
One could of course ask why one should cluster data which is already labelled.
--> there is currently no such operator available in RapidMiner and you would have to implement something like this yourself. We also added this operator on our todo list.
For smaller number of clusters, there is a workaround using only existing operators without any coding. You could find out which cluster number corresponds best to which label and use the AttributeValueMapper for mapper for mapping the cluster number to the corresponding label. Then, change the cluster attribute role to a prediction by using the ChangeAttributeRole operator and use one of the performance evaluation operators to calculate the performance. The single operator mentioned above could do that automatically, especially the search for the best clustering / label mapping will become cumbersome for larger numbers of clusters.
2) use a cross validation on a supervised learning scheme with the cluster as label and look how good it can be learned.
There is a lot of dicussion about this evaluation method outside (which I will not start here) but at least this can easily be done with the existing operators.
Cheers,
Ingo
Answer by topic starter:
So I have for example iris data and these labels: iris-setosa, versicol.,viginica and I applied K-means which give me clusters classes for example: 2,0,1 (there isn't any order) and may be cluster 2 mistacely split setosa and versicol and clust. 0 have only half of versicol. So if this can be done by this way as you have wrote what I need to write into these block you have talk about?
Edit by topic starter:
The best solussion could be if you will send some xml code. I don't know how to set it up. Firstly I don't know what means "attributes" and "replace what" in AttributeValueMapper. Replace what could be iris-set., iris-virg,iris-vers. and by: 0,1,2 I think. But what means attributes? What means the name in ChangeAttributeRole? Better send some example, it will be more quick then hard explaining to me.
Answer by Ingo:
Hi,
here you go (although you really could try it first to find such a setup - you will learn quicker then ;-):
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_examples" value="400"/>
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="4"/>
</operator>
<operator name="AttributeValueMapper" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster3"/>
<parameter key="replace_what" value="0"/>
</operator>
<operator name="AttributeValueMapper (2)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster2"/>
<parameter key="replace_what" value="1"/>
</operator>
<operator name="AttributeValueMapper (3)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster0"/>
<parameter key="replace_what" value="2"/>
</operator>
<operator name="AttributeValueMapper (4)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster1"/>
<parameter key="replace_what" value="3"/>
</operator>
<operator name="AttributeCopy" class="AttributeCopy">
<parameter key="attribute_name" value="cluster"/>
<parameter key="new_name" value="cluster_pred"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="cluster_pred"/>
<parameter key="target_role" value="prediction"/>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
Please note that the mappings would not be necessary if we would add an operator performing the search for the best mapping. The attribute copy ("attributes" are the same as "features", "variables", often "columns" in RapidMiner) is necessary since the ClusterModel depends on the cluster attribute and we are not simply allowed to change the role of the cluster attribute. Instead of this, you could also copy the complete data set with an IOMultiplier (only the view is copied, not the data) or remove the cluster model with an IOConsumer. You see there are often a lot of options for achieving the same goal in RapidMiner.
Cheers,
Ingo
Answer by John:
Many thanks. Good sophisticated way . This finaly helped. There is not problem to deal with 3 clusters. So I try to find highest accuracy number which I get from Validation block Performance. The results seems too bad. The Best are k-medoids and k-means with 89%. It's very similar to Adjusted rand criteria I think, it has the same table how much objects from some class fit to another class. Interesting is that better solussion when the dataset is not normalized for k-medoid a k-means (with normalisation it is only 82% both)and batch k-means(simple k-means from weka) + x-means better have dataset normalized, better about 1%. So what do you think? Is better use normalized data or not? I was thinking before that normalized data are important. Mainly why there is degradation of clustering quality in k-means and k-medoid with normalization?
Thanks for your help with that validation through that mapping.
Have a nice day
Reagards John
Answer by Anonymous:
hi.
We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use.
Any help is important!
Thanks
Edit by Anonymous:
Sorry...i missed say that we used k-means and now we have to validate it, and we dont know how.
Thanks
Answer by Ingo:
Hello,
> We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use.
You mean beside the ones discussed above? I would suggest that you first try the process specified above. There are also a lot of examples in the sample directory delivered together with RapidMiner about Clustering.
Cheers,
Ingo
Answer by Anonymous:
Yes. But all for supervised classification. We want a unsupervised learner. We investigate that onde measure of validation is SSE, how can we do that in rapid miner? we try ClusterCentroidEvaluator but de DB=-0,56 was the result. What does it mean? Anyone knows?
Thanks.
Tagged:
1
Answers
Cheers,
Ingo
this post is a bit old (and I'm not sure this is the right form for my question) but I've been trying to find some references to point 2) of your reply about clustering validation and couldn't really find anything related: Could you point me to literature(papers)/websites/forms... where this discussion is going on?
Thanks a lot!
damon
Hola una consulta, deseo validar clustering bajo algoritmos como dbscan o medoides, mi duda es como validarlos en rapidminer, que tiene para analizar performance solo para algoritmo de KMEDIAS- XMEDIAS , lei que se puede insertar validadores de R, mediante la extensión en rapidminer, pero no se como?. Alguna sugerencia para poder llegar a decir estos resultados de clustering de dbscan o kmedoide son bueno? ... gracias
Hi a query, I validate clustering algorithms like "dbscan" or
"medoids". My question is as validate these algorithms clustering in RapidMiner,
Is possible to implement validation of R in rapidminer? how?
is "davies doublin" index used alone for "kmedia" or "kmedoide"? help please!!