The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How do I create balanced clusters?

kikikubikovakikikubikova Member Posts: 3 Learner I
edited July 2022 in Help
Hi guys,

I'm pretty new to the community so sorry if my question will seem quite elementary, but how do I create balanced clusters (k-means) - meaning that each cluster will have the same size of items in it? Or is there a way to force a minimum cluster size to anything else than 1? 

(What I am trying to do is to create pairs based on some variables - I have a list of villages, their population size, average age, unemployment etc. And for each village in my dataset I am looking for the village with the most similar parameters in all of the variables - matching the most alike villages. My idea was to do N/2 clusters to create pairs, but as I don't know how to do balanced clusters or how to force the minimum size of a cluster to 2 items, the output was N/2 clusters but unfortunatelly there weren't 2 items in each, creating some clusters with i.e. 3 items and some with 1 item in it.)

Thank you for all of your advices (the simpler solution the better :smiley: ) !

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,
    let me ask the question first: Why do you want to cluster? What do you want to do with the similar villages?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kikikubikovakikikubikova Member Posts: 3 Learner I
    Hi, 

    thank you for your comment. So, as my master thesis I am analyzing the effect of rainfall on voter turnout. I have turnout and precipitation data for around 600 villages as well as some basic information like unemployment, area, population size etc. My task is to match the villages based on the parameters, finding the most suitable pair to perform a diff-in-diff model (checking how the difference in turnout changes with the difference in precipitation and other independent variables between the two villages during years).

    Do you have any idea how to fix the number of items in a cluster? Or how to increase the minimum number of items in a cluster to 2?

    Thanks,

    Kristina
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,
    so why do you think clustering would be a good approach here?
    I would either:
    - Build a predictive model to predict turn out and then look at the influence of rain fall, for example with PDP plots.
    - Search for the k-Next neighbours of each village and then compare their turnout with the turnout in this one village.

    For normal K-Means you are not able to control cluster sizes, other than merging "close" clusters until they got big enough. That would be a manual process I think. Anyway, I do not think clustering is a suitable approach here.

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kikikubikovakikikubikova Member Posts: 3 Learner I
    With all due respect, Martin, that was not my question. I wasn't asking about your opinion on my model and methodology, I was asking how to do balanced clusters using RapidMiner. For your information, you are able to control cluster sizes using k-means (i.e. DOI:10.1007/978-3-662-44415-3_4 or simply Wikipedia "Balanced clustering"), and you can do so using for example R or Python. My question was whether it is possible, and if so how to do it, using RapidMiner. 

    Anyway, thank you for your adivce!

    BR,

    Kristina
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Then: Balanced K-Means is not available in RM natively.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.