The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Davis bouldin index
Learner I
Hi, I am using davis bouldin index and got minus 2. When I changed in attributes I got - 4
Which one is better? - 2 or - 4?
Which one is better? - 2 or - 4?
0
Comments
Great question! The D-B index was multiplied by -1 internally for maximizing it. It is a kind-of bug. You could ignore the negative sign from the performance output. So the clustering model with DB index -2 is better.
"clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm" -Wikipedia
The Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin.
My attached process is an optimization to pick the best K for K-means model, which returns k=3 has the lowest D-B index. You can also try X-means to get an optimized clustering.
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value="yhuang@rapidminer.com"/> <parameter key="process_duration_for_mail" value="1"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Ripley-Set" width="90" x="45" y="34"> <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/> </operator> <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="279" y="34"/> <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="145" name="Optimize Parameters" width="90" x="514" y="34"> <list key="parameters"> <parameter key="Clustering.k" value="[2.0;20;19;linear]"/> </list> <parameter key="error_handling" value="fail on error"/> <parameter key="log_performance" value="true"/> <parameter key="log_all_criteria" value="false"/> <parameter key="synchronize" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="fast_k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="34"> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="false"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="k" value="2"/> <parameter key="determine_good_start_values" value="false"/> <parameter key="measure_types" value="NumericalMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="GeneralizedIDivergence"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> <parameter key="max_runs" value="10"/> <parameter key="max_optimization_steps" value="100"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <operator activated="true" class="cluster_distance_performance" compatibility="9.2.000" expanded="true" height="103" name="Performance" width="90" x="648" y="34"> <parameter key="main_criterion" value="Davies Bouldin"/> <parameter key="main_criterion_only" value="true"/> <parameter key="normalize" value="false"/> <parameter key="maximize" value="false"/> </operator> <connect from_port="input 1" to_op="Clustering" to_port="example set"/> <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/> <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/> <connect from_op="Performance" from_port="performance" to_port="performance"/> <connect from_op="Performance" from_port="example set" to_port="output 1"/> <connect from_op="Performance" from_port="cluster model" to_port="model"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <description align="left" color="green" colored="true" height="173" resized="false" width="626" x="109" y="164">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.<br><br>How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description> </process> <description align="center" color="transparent" colored="false" width="126">figure out the best k for k-means</description> </operator> <operator activated="true" class="x_means" compatibility="9.0.000" expanded="true" height="82" name="X-Means" width="90" x="514" y="289"> <parameter key="add_cluster_attribute" value="true"/> <parameter key="add_as_label" value="false"/> <parameter key="remove_unlabeled" value="false"/> <parameter key="k_min" value="2"/> <parameter key="k_max" value="10"/> <parameter key="determine_good_start_values" value="false"/> <parameter key="measure_types" value="NumericalMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="GeneralizedIDivergence"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> <parameter key="clustering_algorithm" value="KMeans"/> <parameter key="max_runs" value="10"/> <parameter key="max_optimization_steps" value="100"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <description align="center" color="transparent" colored="false" width="126">run x-means for an optimzied clustering</description> </operator> <connect from_op="Ripley-Set" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters" to_port="input 1"/> <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/> <connect from_op="Optimize Parameters" from_port="parameter set" to_port="result 1"/> <connect from_op="Optimize Parameters" from_port="output 1" to_port="result 2"/> <connect from_op="X-Means" from_port="clustered set" to_port="result 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="42"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="189"/> <portSpacing port="sink_result 4" spacing="0"/> </process> </operator> </process>YY
Thanks for the reply
But I saw other comments here for other post asking same question and got different reply. We should take the minimum and if maximized (remove multiplication by - 1) we should take the greate number. This what makes me confused
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Ingo