The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Should you normalize dummy coded variables in clustering?
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornThe distance calculations are going to be biased if your attributes are in dramatically different ranges. So as @IngoRM says the best solution would be to normalize all attributes into the same range (i.e., just use range normalization on the interval 0-1 for your numerics). If you don't have extreme outliers, that would be fine.
However, I ordinarily wouldn't recommend normalizing dummy variables using the z-score method because the z-score method is not well suited to exclusively bi-modal distributions (which a dummy variable is by definition).
If you have already used z-score normalization on your numerical attributes and you also have dummy variables then as long as you don't have any massive outliers you can also just normalize the z-scores again into the 0-1 range method and it should also be fine.
But even leaving the z-scores shouldn't be too bad (since they are typically in the range -3 to 3) and it is certainly better than no normalization of numericals at all. You can actually test this yourself by doing different types of normalization and seeing the effect on the resulting clusters. In my experience, there is not usually a major difference in these cases.
If you do have significant outliers, you might consider reviewing them carefully before trying to do the clustering because they are going to be problematic no matter which approach to normalization you choose.
7
Answers
i usually use PCA after dummy coding to get rid of the problem.
Best,
Martin
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
i later on join the original data back to the clustering results and start to interpret from there.
BR,
Martin
Dortmund, Germany