Rapidminer Equivalent for SPSS Two Step Clustering
Hello,
I was wondering if there is a clustering algorithm similar to SPSS's Two Step Clustering? I've used it during school a few times and it was very good at auto clustering large datasets with mixed datatypes. Aside from having felt the need for it in previous projects in the past, currently I have a dataset that is composed of three different sets -
- survey responses (nominal)
- user transaction/redemption amount $$ and count (real, integer)
- census data
I've tried clustering analysis with K-means, medoids, etc. with Mixed Euclidean Distance but I often have to perform manual variable selection and number of k multiple times to get clusters that look distinct from one another. Two Step Clustering takes care of the variable selection process automatically and a whole lot of other stuff in the background. I just have to clean and prepare my final dataset. Basically I'd like to avoid going back and forth so much and save time by having an algorithm take care of getting the best clusters possible.
Thank you.
Answers
I think the X-means clustering operator will do that for you.
Thank you very much Thomas, but I get a warning that says 'X-Means cannot handle polynominal attributes'. The measure Type is set to "Mixed Measures" with "Mixed Euclidean Distance" function.
I would love to have an algorithm that will accept categorical/class/nominal data together with numerical data as inputs.
Edit: The process executes inspite of the warning but the results are pretty much same as K-Means. The centroids, cluster separation, and number of records in each cluster are very similar.
You can easily transform the polynominal attributes into numerical attributes using "Nominal to Numeric" (I recommend the dummy coding conversion for this purpose). Independent of that, you'll want to normalize all your data so differences in attribute scales do not distort your distance metric. Once you do then the clustering should be more reliable. Try that and see whether it makes any difference.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you Brian, but now I cannot figure out how do I interpret the cluster composition results. For ex:
attribute -> Coupon Redemption Bucket
classes -> High (40 or more), mid (20-40), low (0-20)
after dummy coding and each class is an attribute on its own with binary bit flags (0,1). I want to be able to say the following about my clusters -
Cluster 1 mostly consists of users in the high frequency coupon redemption bin vs cluster 2 which is composed of low frequency coupon redemption members, and so on and so forth with the rest of the variables. With the given result set, I'm not sure how do I get to that point.
If you look at the attribute values in the centroid table (part of the output when you connect the cluster output port) then you should be able to tell which attributes are associated with each cluster. If you normalized your data using a method that puts everything into the scale of 0-1 (like range transformation), then for 0/1 dummy values, the value shown will basically correspond to the % of records in that cluster that have that particular category. You can then also sort the table to see the highest values for each cluster or even compute differences to find the attributes that are most different between clusters.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts