Extract cluster centroids and compare with other centroids?
hi,
I want to do clustering with k-means e.g with k= 3...20 on 2 datasets, and I want to extract the centroids from those clusters and compare the centroids from dataset 1 with the centroids from dataset2.. (e.g. by the euclidean distance).. is there some way to do that? and if I compare centroids, how can I extract those 2 centroids from dataset1 and datatset 2 that are closest to eachother?
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533
RM Data Scientist
Fred,
check the attached Process. I think this is what you want?
~Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.4.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
<operator activated="true" class="k_means" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="313" y="34"/>
<operator activated="true" class="extract_prototypes" compatibility="7.4.000" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="447" y="34"/>
<operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="(.+)"/>
<parameter key="replace_by" value="Squared_$1"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.4.000" expanded="true" height="82" name="Clustering (2)" width="90" x="313" y="136">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="ManhattanDistance"/>
</operator>
<operator activated="true" class="extract_prototypes" compatibility="7.4.000" expanded="true" height="82" name="Extract Cluster Prototypes (2)" width="90" x="447" y="136"/>
<operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace (2)" width="90" x="581" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="(.+)"/>
<parameter key="replace_by" value="Manhattan_$1"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role (2)" width="90" x="715" y="136">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="cross_distances" compatibility="7.4.000" expanded="true" height="103" name="Cross Distances" width="90" x="849" y="85"/>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Clustering (2)" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Clustering (2)" from_port="cluster model" to_op="Extract Cluster Prototypes (2)" to_port="model"/>
<connect from_op="Extract Cluster Prototypes (2)" from_port="example set" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Replace (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany2 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533
RM Data Scientist
Hi,
so you want to cluster and check if there are clusters with purely one label in? Sounds like aggregate count(label) group_by(cluster)? Otherwise you might want to check the operator Map Clustering On Label.
Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
yes that was pretty much what I was looking for
thanks..
but one more question, is it possible to cluster by labels? I mean each label as one cluster, and then extract or calculate the cluster centroid of each label group? how does it work, should I give the label the role "cluster"? or how?
yeah, thats part of what I originally wanted to do.. is it any possible to declare an example set as a Cluster model? e.g. after I aggregated the class labels and built average / centroid of all class values, can I declare those centroids as a cluster model?
edit: sorry I just noticed, that would then be no more necessary as centroids are transformed into normal example sets after thendata:image/s3,"s3://crabby-images/9639d/9639d3cbd44be3ef6415386eb1e8ece11ef0b9c1" alt=";) ;)"
the formula to get Centroids by label lass is it the same as you described above?