The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

HOW TO Validate k-means Clustering?

shredlegend88shredlegend88 Member Posts: 10 Contributor II
edited November 2018 in Help

It seems like a simple question.  I have a dataset I am performing a k-means cluster analysis for consumers bankruptcy tendency (k=2). I need to know the best way to validate my models predictive accuracy.  I have wasted about 5 hours trying and failing. 

 

My text states the easiest way is by generating a confusion/classification matrix, but for the life of me, I cannot figure out what setting/operator/selection etc. to do this in RM!!!

All I get for my results is shown below.  This is not good enough for me to know how well my model is performing against my testing/validation set.  I am using a cross validation operator containing my cluster model on the training section, and the apply model and cluster distance performance operator on the training section.  All i get is this.  Why so little information?  

Avg. within centroid distance

Avg. within centroid distance: -6.053 +/- 0.279 (mikro: -6.053)

I have attached my dataset and xml of my process.

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Shredlegend88,

     

    if you want to get a confusion matrix, you need to use a performance operator for supervised classification problem. This requieres a label. If you go purely unsupervised, you cannot define a confusion matrix.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • shredlegend88shredlegend88 Member Posts: 10 Contributor II

    My dataset has a label, however, when I try an use the performance operator, i get the error "Input ExampleSet does not have predicted label attribute".

     

    What does this mean and how to I fix it?  I have tried many approaches, adding dummy variables, changing my labels role/type/etc.

  • shredlegend88shredlegend88 Member Posts: 10 Contributor II

    Martin,

    Good afternoon.  I successfully gotten a confusion matrix output through trial and error, however, the accuracy is zero percent.  Could you take a look at my process and let me know if you can see why?  I think it has something to do with roles (label vs prediction) for my target variable (bankruptcy).  I do not understand the critieria to have one or the other.

     

    It seems that the Performance (Classification) operator requires a variable with a role of "prediction".  Am I correct in assuming that the variable I am trying to isolate between my two clusters should be set to prediction?  

     

    When I change it from Label to Prediction, it performs the analysis, but the accuracy is zero and I don't understand why.  All of the selected variables I chose are sufficiently correlated to my target variable (bankruptcy), however, the confusion matrix states an accuracy of zero.  To further confuse things, there is a warning on ther performance operator "Input example set must have special attribute 'label'".  My cluster model has "add as label" checked which is maybe why it does not error, but I am not sure.

     

    When selected the Performance (Classification) operator, I see main criterion and it is currently set to "accuracy".  Maybe this is the culprit.  I do not see anywhere where these criterion are documented.  Can you point me in the right direction?  I am new to this tool and I have spent days now trying to figure this out and it is due tonight.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I replied already in your other thread. What Martin is getting at is that Clustering is unsupervised learning. Essentially you create statistical "blobs" (i know @mschmitz will groan at this) of similar data. You can easily see that Cluster 2 has tends to have higher rates of bankruptcy based on your normalized data.  If you want to predict and calculate a confusion matrix, you will need to create a "label" such as "default" and "no default." Then you would use Cross Validation, measure the Classificaiton performance, and generate a confusion matrix. 

     

    With Clustering, there are ways to measure the performance but the results will not generate a confusion matrix.

Sign In or Register to comment.