The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Mini Batch K-means in RapidMiner
Hi
I have a huge dataset (4000000 records) of text data and I want to do clustering.
Because of memory problems and time complexity of text pre-processing I want to read small batches from database and after pre-processing use mini-batch K-means to cluster data. But I wonder how to use mini-batch clustering in RapidMiner.
Thanks in advance for your answers.
Tagged:
0
Answers
Hi,
there are different Loop operators in RapidMiner.
You can easily implement this batching behaviour by using a loop with a numeric counter and select data from your database with LIMIT n OFFSET (i - 1) * n.
n would be your preferred batch size, and i the current iteration number, starting at 1. Usually you need to calculate the offset yourself outside of the statement, e. g. with Generate Macro. Not all databases support the LIMIT ... OFFSET syntax, but most have the functionality under a different name.
Regards,
Balázs
Hi thanks for your answer
Mini batch K-Means algorithm takes small batches of the dataset for each iteration. It then assigns a cluster to each data point in the batch, depending on the previous locations of the cluster centroids and updates the locations of cluster centroids based on the new points from the batch.
How could I make a process like this?
because loop operator in each iteration makes new clusters for current batch and doesn’t assign new points to previous clusters
@BalazsBarany
Hi,
for this algorithm you'd need an operator to remember the cluster centroids from the previous clustering and a clustering operator that can take these as it's input. Extract Cluster Prototypes does something like this for the first step but I don't know a way for pushing these into a new clustering.
Regards,
Balázs
I was actually trying to work on a cluster model that I wanted to update with new data and rather than running the whole thing again planned to use the centroids to update it. (Limited resources on a hadoop cluster mean I can only cluster 1,000,000 records at a time).
This is what I considered which sounds similar to minibatch. About to test it, so maybe you guys could have a look?
The idea was to weight the centroids generated from Extract Cluster Prototypes by simply duplicating them. In my head I figured that would bias it towards that value for centroids, but not necessarily force the cluster to accept them as final-final.