The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Rapidminer Equivalent for SPSS Two Step Clustering

batstache611batstache611 Member Posts: 45 Maven
edited November 2018 in Help

Hello,

 

I was wondering if there is a clustering algorithm similar to SPSS's Two Step Clustering? I've used it during school a few times and it was very good at auto clustering large datasets with mixed datatypes. Aside from having felt the need for it in previous projects in the past, currently I have a dataset that is composed of three different sets -

 

  • survey responses (nominal)
  • user transaction/redemption amount $$ and count (real, integer)
  • census data

I've tried clustering analysis with K-means, medoids, etc. with Mixed Euclidean Distance but I often have to perform manual variable selection and number of k multiple times to get clusters that look distinct from one another. Two Step Clustering takes care of the variable selection process automatically and a whole lot of other stuff in the background. I just have to clean and prepare my final dataset. Basically I'd like to avoid going back and forth so much and save time by having an algorithm take care of getting the best clusters possible.

 

Thank you.

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I think the X-means clustering operator will do that for you. 

  • batstache611batstache611 Member Posts: 45 Maven

    Thank you very much Thomas, but I get a warning that says 'X-Means cannot handle polynominal attributes'. The measure Type is set to "Mixed Measures" with "Mixed Euclidean Distance" function.

     

    I would love to have an algorithm that will accept categorical/class/nominal data together with numerical data as inputs.

     

    Edit: The process executes inspite of the warning but the results are pretty much same as K-Means. The centroids, cluster separation, and number of records in each cluster are very similar.

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can easily transform the polynominal attributes into numerical attributes using "Nominal to Numeric" (I recommend the dummy coding conversion for this purpose).  Independent of that, you'll want to normalize all your data so differences in attribute scales do not distort your distance metric.  Once you do then the clustering should be more reliable.  Try that and see whether it makes any difference.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • batstache611batstache611 Member Posts: 45 Maven

    Thank you Brian, but now I cannot figure out how do I interpret the cluster composition results. For ex: 

     

    attribute -> Coupon Redemption Bucket

    classes -> High (40 or more), mid (20-40), low (0-20)

     

    after dummy coding and each class is an attribute on its own with binary bit flags (0,1). I want to be able to say the following about my clusters - 

    Cluster 1 mostly consists of users in the high frequency coupon redemption bin vs cluster 2 which is composed of low frequency coupon redemption members, and so on and so forth with the rest of the variables. With the given result set, I'm not sure how do I get to that point.

     

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you look at the attribute values in the centroid table (part of the output when you connect the cluster output port) then you should be able to tell which attributes are associated with each cluster.  If you normalized your data using a method that puts everything into the scale of 0-1 (like range transformation), then for 0/1 dummy values, the value shown will basically correspond to the % of records in that cluster that have that particular category.  You can then also sort the table to see the highest values for each cluster or even compute differences to find the attributes that are most different between clusters.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.