Assumptions of categorical variables for k-means clustering?
I've been given a dataset for an exercise in k-means clustering with 5 variables. Three of them are continuous (age of customer, number of items per transaction and dollar value of transaction) however the other two are not i.e. binomial (in store or online transaction as 1 or 0) and the other is polynomial ('Region' with values of 1,2,3 or 4). (Although they are both currently in the dataset as integers)
Am I correct in assuming that I should exclude the transaction type and region? My logic is that centroids produced are more or less garbage given that the transaction can't be half way between an online or in store transaction. Similarly, with geographical regions - and average value is meaningless.
Thanks in advance for any and all assistance. I've spent the last day and a half researching online and am none the wiser (with any certainty).
Answers
Cheers,
Dortmund, Germany
RapidMiner provides the "mixed Euclidean" distance option, which is Euclidean distance for numericals and nominal 0/1 distances for nominal attributes. As long as the numericals have been normalized into a [0,1] interval, this removes any bias from scaling.
Regarding the interpretability of nominal distances, this is a bit tricky of course, but it's really no worse than the same conceptual problem of averaging a binominal 0/1 attribute. The resulting number tells you something but it doesn't really correspond to a single observation.
I would actually stay away from encoding of Nominal to Numerical (dummy coding) for your nominal attributes if you are going to do any clustering, since this will multiply each nominal attribute into a new number of attributes based on the number of distinct nominal values it has. Each of these is then a 0/1 attribute which will go into the overall distance calculation, and thus this ultimately gives the nominal attributes with more possible values higher weight in the mixed Euclidean distance than those with fewer values (and all of the nominals are then more heavily weighted as a group than your numerical attributes, which still have one each) and this is not typically a desirable outcome.
If you want to constrain your cluster centers to always be at the location of an actual observation and not instead at some imaginary point, then simply use the k-medoids operator instead. It uses the same distance metric and underlying algorithm as k-means, only the center of each cluster is constrained to be an actual data point and not just an average of similar data points.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
However, if the analyst does PCA on the full set of attributes, it will probably end up commingling the nominal and the numerical attributes, which is horrible from an interpretability perspective. So really to do it properly based on the approach you suggest, you would need to perform PCA only on the attributes created from the nominal to numerical conversion. This may not be understood by everyone.
In practice, most folks don't like cluster results built on PCA attributes anyways, since the PCA attributes are essentially unintelligible. In my experience doing clustering projects, there is still a strong desire to connect the cluster results to something tangible about the underlying data. Hence I recommend not going down the road of the PCA transformation at all (or any other synthetic form of attribute variance reduction).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts