Extract cluster centroids and compare with other centroids?
hi,
I want to do clustering with k-means e.g with k= 3...20 on 2 datasets, and I want to extract the centroids from those clusters and compare the centroids from dataset 1 with the centroids from dataset2.. (e.g. by the euclidean distance).. is there some way to do that? and if I compare centroids, how can I extract those 2 centroids from dataset1 and datatset 2 that are closest to eachother?
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Fred,
check the attached Process. I think this is what you want?
~Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.4.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
<operator activated="true" class="k_means" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="313" y="34"/>
<operator activated="true" class="extract_prototypes" compatibility="7.4.000" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="447" y="34"/>
<operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="(.+)"/>
<parameter key="replace_by" value="Squared_$1"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.4.000" expanded="true" height="82" name="Clustering (2)" width="90" x="313" y="136">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="ManhattanDistance"/>
</operator>
<operator activated="true" class="extract_prototypes" compatibility="7.4.000" expanded="true" height="82" name="Extract Cluster Prototypes (2)" width="90" x="447" y="136"/>
<operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace (2)" width="90" x="581" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="(.+)"/>
<parameter key="replace_by" value="Manhattan_$1"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role (2)" width="90" x="715" y="136">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="cross_distances" compatibility="7.4.000" expanded="true" height="103" name="Cross Distances" width="90" x="849" y="85"/>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Clustering (2)" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Clustering (2)" from_port="cluster model" to_op="Extract Cluster Prototypes (2)" to_port="model"/>
<connect from_op="Extract Cluster Prototypes (2)" from_port="example set" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Replace (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany2 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi,
so you want to cluster and check if there are clusters with purely one label in? Sounds like aggregate count(label) group_by(cluster)? Otherwise you might want to check the operator Map Clustering On Label.
Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
yes that was pretty much what I was looking for thanks..
but one more question, is it possible to cluster by labels? I mean each label as one cluster, and then extract or calculate the cluster centroid of each label group? how does it work, should I give the label the role "cluster"? or how?
yeah, thats part of what I originally wanted to do.. is it any possible to declare an example set as a Cluster model? e.g. after I aggregated the class labels and built average / centroid of all class values, can I declare those centroids as a cluster model?
edit: sorry I just noticed, that would then be no more necessary as centroids are transformed into normal example sets after then
the formula to get Centroids by label lass is it the same as you described above?