The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to evaluate the best algorithm preform clusters?
halaalrobassy
Member Posts: 16 Contributor II
i have dataset and i want to cluster one feature to three clusters, i choose kmeans, kmedoid and xmean algorithms to preform this clustering then i want to evaluate which algorithm will perform better clustering.
i put the three algorithms in loop parameter but i couldn't know where can i put the cluster distance performance operator . i want to see the avg centroid and Davis bouldin measures for each model and according to them then choose the bset model will perform the best clustering .
i put the three algorithms in loop parameter but i couldn't know where can i put the cluster distance performance operator . i want to see the avg centroid and Davis bouldin measures for each model and according to them then choose the bset model will perform the best clustering .
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Engineering_majors1" width="90" x="45" y="34">
<parameter key="repository_entry" value="../data/Engineering_majors1"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="c_cons_sum"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="normalize" compatibility="7.5.003" expanded="true" height="103" name="Normalize" width="90" x="313" y="136">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="true"/>
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="c_cons_sum"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="method" value="range transformation"/>
<parameter key="min" value="0.0"/>
<parameter key="max" value="1.0"/>
<parameter key="allow_negative_values" value="false"/>
</operator>
<operator activated="true" class="loop_parameters" compatibility="6.0.003" expanded="true" height="103" name="Loop Parameters" width="90" x="447" y="136">
<list key="parameters">
<parameter key="Select Subprocess.select_which" value="[1.0;4;4;linear]"/>
</list>
<parameter key="error_handling" value="fail on error"/>
<parameter key="synchronize" value="false"/>
<process expanded="true">
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="82" name="Multiply" width="90" x="45" y="85"/>
<operator activated="true" class="select_subprocess" compatibility="9.2.001" expanded="true" height="103" name="Select Subprocess" width="90" x="246" y="85">
<parameter key="select_which" value="4"/>
<process expanded="true">
<operator activated="true" class="concurrency:k_means" compatibility="9.2.001" expanded="true" height="82" name="Clustering" width="90" x="45" y="85">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="5"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="BregmanDivergences"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="output 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="k_medoids" compatibility="7.5.003" expanded="true" height="82" name="K-Medoids" width="90" x="45" y="187">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="3"/>
<parameter key="max_runs" value="10"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="true"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_port="input 1" to_op="K-Medoids" to_port="example set"/>
<connect from_op="K-Medoids" from_port="cluster model" to_port="output 1"/>
<connect from_op="K-Medoids" from_port="clustered set" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="x_means" compatibility="9.2.001" expanded="true" height="82" name="X-Means" width="90" x="45" y="34">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k_min" value="3"/>
<parameter key="k_max" value="60"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="clustering_algorithm" value="KMeans"/>
<parameter key="max_runs" value="10"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<connect from_port="input 1" to_op="X-Means" to_port="example set"/>
<connect from_op="X-Means" from_port="cluster model" to_port="output 1"/>
<connect from_op="X-Means" from_port="clustered set" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="9.2.001" expanded="true" height="103" name="Performance" width="90" x="380" y="85">
<parameter key="main_criterion" value="Avg. within centroid distance"/>
<parameter key="main_criterion_only" value="false"/>
<parameter key="normalize" value="false"/>
<parameter key="maximize" value="false"/>
</operator>
<connect from_port="input 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Subprocess" to_port="input 1"/>
<connect from_op="Select Subprocess" from_port="output 1" to_op="Performance" to_port="cluster model"/>
<connect from_op="Select Subprocess" from_port="output 2" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="performance"/>
<connect from_op="Performance" from_port="example set" to_port="result 1"/>
<connect from_op="Performance" from_port="cluster model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Engineering_majors1" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
<connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
0
Best Answers
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornJust use the Cluster Distance Performance operator after each cluster method and it will give you the metrics you seek.
Of course, the value chosen for k for any of these will have a big impact on the cluster metrics. Do you have an a priori value you are using? Or do you need to do this evaluation over a range of different possible k-values (in which case you may want to build some loops and do some logging of results).
Additionally you should recall that K-means and X-means are actually the same algorithm, with x-means seeking for the "best" value of k automatically. So they won't give you different metrics if the k-value is the same for both.
7 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornThere are no other major parameters for optimizing k-means other than the selection of k, so if you have that known in advance then I don't think there is really anything else for you to do. The only other parameter is the distance measure, and that is something you also typically determine in advance which one is suitable for your use case rather than "optimizing" it. If you have both numerical and nominal attributes you will be limited to Mixed Euclidean anyways.
You should be sure to normalize your data before running any k-means or k-medoids!
Above I was referring to whether you already knew the number of clusters (k) you wanted to use. If you already know k, then you also have no need for x-means since x-means is simply k-means searching across a range of possible k values, so I would drop that from your analysis.6
Answers