"X-Means doubling cluster item counts as of 9.0.001"
Hello,
I want to report what seems to me to be a bug as of RapidMiner 9.0.001 and 9.0.003 (no problem on this issue in version 9.0.000). When I run X-Means and examine the Cluster Model results, it seems to double the counts within each cluster and the total. I've experienced this on at least two separate datasets. I am posting a reproducible example at the bottom of this post.
In the posted process, I adapted the K-Means tutorial process in RapidMiner (which uses the built-in Iris dataset) to use X-Means instead of K-Means. I copied all the same settings from K-Means operator provided in the tutorial. If I change the compatibility of the X-Means operator to 9.0.000, the Cluster Model reports 150 total items, as expected. But versions 9.0.001 and the current 9.0.003 compatibilty double the number of items reported to 300, and the cluster members seem to be doubled, too. This seems to me to be an obvious bug.
In general, I'm pretty whacked out by whatever changes were made in X-Means as of 9.0.001. It is giving me other odd results with some other datasets (such as infinity value for the Davies Bouldin index and "unknown" for "Avg. within centroid distance_cluster_0"). And to throw something else in, checking the option "determine good start values", which is now default, gives crazily irrelevant clusters. I don't know if there is a common root bug causing all these things, but I want to tackle the doubling of cluster member counts first, since that is an obvious error (unless there's something I really don't understand) that I've consistently found across three distinct datasets.
I would appreciate confirmation if this is indeed a bug and if not, clarification of what I might be doing wrong.
Thanks!
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
<parameter key="random_seed" value="2001"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve Iris" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
<description align="center" color="blue" colored="true" width="126">The Iris data set is retrieved from the Samples folder.<br/>The label Attribute remains in the ExampleSet for comparison the results of the Clustering. It is not used in the Clustering itself.</description>
</operator>
<operator activated="false" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TUTORIAL" width="90" x="246" y="136">
<parameter key="k" value="3"/>
<parameter key="determine_good_start_values" value="false"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="x_means" compatibility="9.0.003" expanded="true" height="82" name="X-Means" width="90" x="246" y="34">
<parameter key="k_max" value="10"/>
<parameter key="determine_good_start_values" value="false"/>
<parameter key="measure_types" value="BregmanDivergences"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.0.003" expanded="true" height="82" name="Aggregate" origin="GENERATED_TUTORIAL" width="90" x="447" y="85">
<list key="aggregation_attributes">
<parameter key="a1" value="count"/>
</list>
<parameter key="group_by_attributes" value="label|cluster"/>
<description align="center" color="purple" colored="true" width="126">The Aggregate Operator is used to count the number of Examples for each combination of cluster_idea and value of the label Attribute</description>
</operator>
<operator activated="true" class="order_attributes" compatibility="9.0.003" expanded="true" height="82" name="Reorder Attributes" origin="GENERATED_TUTORIAL" width="90" x="648" y="187">
<parameter key="attribute_ordering" value="cluster|label"/>
</operator>
<operator activated="true" class="sort" compatibility="9.0.003" expanded="true" height="82" name="Sort" origin="GENERATED_TUTORIAL" width="90" x="782" y="187">
<parameter key="attribute_name" value="cluster"/>
</operator>
<operator activated="true" class="rename" compatibility="9.0.003" expanded="true" height="82" name="Rename" origin="GENERATED_TUTORIAL" width="90" x="916" y="187">
<parameter key="old_name" value="count(a1)"/>
<parameter key="new_name" value="count"/>
<list key="rename_additional_attributes"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="X-Means" to_port="example set"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
<connect from_op="X-Means" from_port="clustered set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_port="result 2"/>
<connect from_op="Reorder Attributes" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<description align="center" color="purple" colored="true" height="71" resized="true" width="376" x="642" y="278">The aggregated ExampleSet is postprocessed for an easier visualisation.</description>
<description align="left" color="yellow" colored="false" height="281" resized="true" width="788" x="24" y="371">Look into the results of the process:<br>ExampleSet (Rename):<br>- cluster_0 consist mainly of iris_virginica Examples (36) with only a few (3) iris_versicolor Examples<br>- cluster_1 consists completely of iris_setosa Examples (50). Also iris_setosa Example cannot be found in other clusters.<br>- cluster_2 consists most of iris_versicolor Examples (47) but with also some (14) iris_virginica Examples<br><br>ExampleSet (Clustering):<br>- You can visualize the assignment of the Examples to the clusters by using the 'Scatter' Chart, plotting two of the Attributes a1,a2,a3,a4 on x-and y-axis and the cluster Attribute as Color Column<br><br>Cluster Model (Clustering):<br>- The Cluster Model consist information which Example is assigned to which cluster<br/>- the size of the clusters can be visualized as a graph<br/>- the position of the centroids is listed</description>
<description align="center" color="green" colored="true" height="156" resized="false" width="126" x="226" y="207">The k-Means algorithm is used to determine three clusters on the Iris data set and assign each Example to one cluster.</description>
</process>
</operator>
</process>
Comments
Hi @Tripartio,
thank you for reporting this. I checked your processes. It's not possible, that a change of compatibility change the number of items in X-Means. I think this is connected to a 9.0.2 fix:
see: https://docs.rapidminer.com/latest/studio/releases/changes-9.0.2.html
Anyhow, i will create a ticket for this and let our dev team check this out. Thank you so much for reporting!
Best,
Martin
Dortmund, Germany
This will be fixed with the upcoming 9.1 release.
Regards,
Marco