Should you normalize dummy coded variables in clustering?

Curious · April 2019

Can you keep them as dummies and only normalize numeric variables?

Telcontar120 · April 2019

The distance calculations are going to be biased if your attributes are in dramatically different ranges. So as @IngoRM says the best solution would be to normalize all attributes into the same range (i.e., just use range normalization on the interval 0-1 for your numerics). If you don't have extreme outliers, that would be fine.

However, I ordinarily wouldn't recommend normalizing dummy variables using the z-score method because the z-score method is not well suited to exclusively bi-modal distributions (which a dummy variable is by definition).
If you have already used z-score normalization on your numerical attributes and you also have dummy variables then as long as you don't have any massive outliers you can also just normalize the z-scores again into the 0-1 range method and it should also be fine.

But even leaving the z-scores shouldn't be too bad (since they are typically in the range -3 to 3) and it is certainly better than no normalization of numericals at all. You can actually test this yourself by doing different types of normalization and seeing the effect on the resulting clusters. In my experience, there is not usually a major difference in these cases.

If you do have significant outliers, you might consider reviewing them carefully before trying to do the clustering because they are going to be problematic no matter which approach to normalization you choose.

IngoRM · April 2019

Hi,

I would say this depends on the normalization. If you normalize the rest to the range between 0 and 1, you can keep them as is. Otherwise I would personally normalize all columns the same way (e.g. z-transformation).

Hope this helps,

Ingo

MartinLiebig · April 2019

Hi,
i usually use PCA after dummy coding to get rid of the problem.
Best,
Martin

Telcontar120 · April 2019

@mschmitz but doesn't that get rid of your underlying attributes as well and replace them with synthetic PCs? That's probably not a helpful feature for clustering, or at least it wouldn't be for most of the clustering projects I have worked on.

MartinLiebig · April 2019

@Telcontar120,
i later on join the original data back to the clustering results and start to interpret from there.

BR,
Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Should you normalize dummy coded variables in clustering?

Best Answer

Answers