The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RapidMiner 4.2 MPCK-Means random results
Hi!
I know nothing about code and i needed to implement MPCK-Means to solve a problem, so I was happy to find the implementation of this algorithm in RapidMiner 4.2. I have 83 instances and dimensions go from 1 to 5 (i have several sets of variables to cluster the same instances separately). Among those 83 instances, I have 14 "neighborhoods" of points connected by must-link constraints and I don't have any cannot-link constraints. However, I just discovered that the results I had with this algorithm changed completely when I repeated the process with the same dataset and the same parameter settings again after some time. And now I am noticing the clusters change if I change the random seed. I thought it could be from different starting seeds, but the algorithm supposedly uses "farthest first" instead of random initialization (as it is described in the original paper). I tried repeating the process with 1000 initializations (the default is 5) and 100 iterations per run and yet the results change when I change the random number seed! With the same seed I get consistently the same clusters today but tomorrow I may have consistently a different clustering result with the same parameters and fixed seed. This is a bit scary because I now I can't decide on which cluster results should I trust! Why is this happening? Is it only my data? Is this the reason why the clusterer was removed in RM 4.3?
I know nothing about code and i needed to implement MPCK-Means to solve a problem, so I was happy to find the implementation of this algorithm in RapidMiner 4.2. I have 83 instances and dimensions go from 1 to 5 (i have several sets of variables to cluster the same instances separately). Among those 83 instances, I have 14 "neighborhoods" of points connected by must-link constraints and I don't have any cannot-link constraints. However, I just discovered that the results I had with this algorithm changed completely when I repeated the process with the same dataset and the same parameter settings again after some time. And now I am noticing the clusters change if I change the random seed. I thought it could be from different starting seeds, but the algorithm supposedly uses "farthest first" instead of random initialization (as it is described in the original paper). I tried repeating the process with 1000 initializations (the default is 5) and 100 iterations per run and yet the results change when I change the random number seed! With the same seed I get consistently the same clusters today but tomorrow I may have consistently a different clustering result with the same parameters and fixed seed. This is a bit scary because I now I can't decide on which cluster results should I trust! Why is this happening? Is it only my data? Is this the reason why the clusterer was removed in RM 4.3?
0
Answers
RapidMiner 4 is *very* old, and honestly I don't know much about it, especially not about special operators and the politics of adding or removing them. Anyway, my guess about the differing results is that the operator does not consistently use the random generator with the configured seed, but also (erroneously) relies on a system rng. So if all clusters you get seem to make sense, and if the cluster performance operators (do they exist in RM 4?) deliver reasonable values, I would probably trust the outcome.
Best, Marius