The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"k-means Clustering which data belongs to which cluster?"
Hi Community,
I would like to cluster countries due to several factors like: purchasing power, competition, turnover, Ease of doing business, tariffs, political stability etc. etc.
I am creating an Input list with the aim to have a numerical value for each and every factor (that makes it easier to cluster).
As Output I would like to have (let's say for example) 3 cluster and I would like to see which country belongs to wich cluster...
I am working currently with the k-means operator which works quite well but I am not able to see which country belongs to which cluster....
I am creating an Input list with the aim to have a numerical value for each and every factor (that makes it easier to cluster).
As Output I would like to have (let's say for example) 3 cluster and I would like to see which country belongs to wich cluster...
I am working currently with the k-means operator which works quite well but I am not able to see which country belongs to which cluster....
Does anybody has a suggestions?
Thanks a head.
Best regards,
Carlo
Tagged:
0
Best Answers
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistHi @Carlo,
If you have a columns for country name or country code, you can set it as a special role (id/name). Also make sure you add a cluster label from k-means. Then the clustering model will return a data table with one reference columns for country name, another new column added for cluster label.
I used the ICU patient data as example.<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve ICU Morbidity (cour. Sven Van Poucke)" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Community Samples/Community Data Sets/Medical and Health/ICU Morbidity (cour. Sven Van Poucke)"/> </operator> <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="icustay_id"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="numeric"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="real"/> <parameter key="block_type" value="value_series"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_series_end"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34"> <parameter key="attribute_name" value="icustay_id"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> <description align="center" color="transparent" colored="false" width="126">icustay_id is an unique identifier for the patients</description> </operator> <operator activated="true" class="replace_missing_values" compatibility="9.2.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="581" y="34"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="gender"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="default" value="value"/> <list key="columns"/> <parameter key="replenishment_value" value="UNK"/> </operator> <operator activated="true" breakpoints="before" class="concurrency:k_means" compatibility="9.2.000" expanded="true" height="82" name="Clustering" width="90" x="715" y="34"> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="true"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="k" value="5"/> <parameter key="max_runs" value="10"/> <parameter key="determine_good_start_values" value="true"/> <parameter key="measure_types" value="MixedMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="SquaredEuclideanDistance"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> <parameter key="max_optimization_steps" value="100"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <connect from_op="Retrieve ICU Morbidity (cour. Sven Van Poucke)" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/> <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/> <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/> <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/> <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
YY6 -
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistHi @Carlo,
We can convert the region codes from nominal to dummy coding (nominal to numerical operator) and then multiply the region dummy code by 3, or multiply by 5 to change the range of the numerical region attributes to [0,5]. You would also need to apply some normalization on the other columns: purchasing power, competition, turnover, Ease of doing business, tariffs, political stability to make sure these normalized attributes have a smaller range, saying [0.1]. K-NN model with Chebyshev distance will take the region factor as the most important one since distance based clustering models are always sensitive to normalization. This kind of human-interference will increase the weight on region factor. You would need some testing on the multiply factor for region. To get guaranteed results, fitting several clustering models on the subset for each region would be ideal.
YY1
Answers