Assumptions of categorical variables for k-means clustering?

tristar8 · September 2019

I've been given a dataset for an exercise in k-means clustering with 5 variables. Three of them are continuous (age of customer, number of items per transaction and dollar value of transaction) however the other two are not i.e. binomial (in store or online transaction as 1 or 0) and the other is polynomial ('Region' with values of 1,2,3 or 4). (Although they are both currently in the dataset as integers)

Am I correct in assuming that I should exclude the transaction type and region? My logic is that centroids produced are more or less garbage given that the transaction can't be half way between an online or in store transaction. Similarly, with geographical regions - and average value is meaningless.

Thanks in advance for any and all assistance. I've spent the last day and a half researching online and am none the wiser (with any certainty).

MartinLiebig · September 2019

Hi @tristar8 ,

not necessarly.

The problem here is the distance measure. What is the distance between Region='A' and Region='B'? usually this is not defined.

By default RM's K-Means operator can handle nominal types. It uses 'dirac distance', i.e. the distance is 0 if it is the same, 1 else wise. This works okayish if you normalize your other data (you should always do this!) and your attribute is binominal.

For polynominal attributes this may or may not work. In the end choosen the right distance measure is one of the hardest task of clustering.

An alternative would be to use Nominal-2-Numerical (Dummy Coding) and then a PCA. I got good results here. Maybe @Telcontar120 can add some wisdom. I think he is more active in this domain.

Cheers,

MArtin

Telcontar120 · September 2019

As Martin suggests, for truly nominal data, the 0/1 distance measure typically makes the most sense.
RapidMiner provides the "mixed Euclidean" distance option, which is Euclidean distance for numericals and nominal 0/1 distances for nominal attributes. As long as the numericals have been normalized into a [0,1] interval, this removes any bias from scaling.

Regarding the interpretability of nominal distances, this is a bit tricky of course, but it's really no worse than the same conceptual problem of averaging a binominal 0/1 attribute. The resulting number tells you something but it doesn't really correspond to a single observation.

I would actually stay away from encoding of Nominal to Numerical (dummy coding) for your nominal attributes if you are going to do any clustering, since this will multiply each nominal attribute into a new number of attributes based on the number of distinct nominal values it has. Each of these is then a 0/1 attribute which will go into the overall distance calculation, and thus this ultimately gives the nominal attributes with more possible values higher weight in the mixed Euclidean distance than those with fewer values (and all of the nominals are then more heavily weighted as a group than your numerical attributes, which still have one each) and this is not typically a desirable outcome.

If you want to constrain your cluster centers to always be at the location of an actual observation and not instead at some imaginary point, then simply use the k-medoids operator instead. It uses the same distance metric and underlying algorithm as k-means, only the center of each cluster is constrained to be an actual data point and not just an average of similar data points.

MartinLiebig · September 2019

Hey @Telcontar120

I would actually stay away from encoding of Nominal to Numerical (dummy coding) for your nominal attributes if you are going to do any clustering, since this will multiply each nominal attribute into a new number of attributes based on the number of distinct nominal values it has. Each of these is then a 0/1 attribute which will go into the overall distance calculation, and thus this ultimately gives the nominal attributes with more possible values higher weight in the mixed Euclidean distance than those with fewer values (and all of the nominals are then more heavily weighted as a group than your numerical attributes, which still have one each) and this is not typically a desirable outcome.

This is why is would ALWAYS do a PCA right afterwards. That should, theoretically, take care of this implicit double counting.

BR,

Martin

Telcontar120 · September 2019

@mschmitz of course as usual I agree with you---in theory, anyways.

However, if the analyst does PCA on the full set of attributes, it will probably end up commingling the nominal and the numerical attributes, which is horrible from an interpretability perspective. So really to do it properly based on the approach you suggest, you would need to perform PCA only on the attributes created from the nominal to numerical conversion. This may not be understood by everyone.
In practice, most folks don't like cluster results built on PCA attributes anyways, since the PCA attributes are essentially unintelligible. In my experience doing clustering projects, there is still a strong desire to connect the cluster results to something tangible about the underlying data. Hence I recommend not going down the road of the PCA transformation at all (or any other synthetic form of attribute variance reduction).

d19125042 · November 2019

Hi @Telcontar120, I want to do a cluster analysis on a dataset with only polynominal values. I have around 16,000 rows with 10 attributes... what is the best operator to use in RapidMiner? I was using k-means but the cluster method only outputs to groups. Any ideas?

Telcontar120 · November 2019

I don't think I understand the question---all the clustering attributes are going to group your examples, that is what clustering does. What is the specific deficiency in k-means output that you are not happy with?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Assumptions of categorical variables for k-means clustering?

Answers