HOW TO Validate k-means Clustering?
It seems like a simple question. I have a dataset I am performing a k-means cluster analysis for consumers bankruptcy tendency (k=2). I need to know the best way to validate my models predictive accuracy. I have wasted about 5 hours trying and failing.
My text states the easiest way is by generating a confusion/classification matrix, but for the life of me, I cannot figure out what setting/operator/selection etc. to do this in RM!!!
All I get for my results is shown below. This is not good enough for me to know how well my model is performing against my testing/validation set. I am using a cross validation operator containing my cluster model on the training section, and the apply model and cluster distance performance operator on the training section. All i get is this. Why so little information?
Avg. within centroid distance
Avg. within centroid distance: -6.053 +/- 0.279 (mikro: -6.053)
I have attached my dataset and xml of my process.
Answers
Shredlegend88,
if you want to get a confusion matrix, you need to use a performance operator for supervised classification problem. This requieres a label. If you go purely unsupervised, you cannot define a confusion matrix.
~Martin
Dortmund, Germany
My dataset has a label, however, when I try an use the performance operator, i get the error "Input ExampleSet does not have predicted label attribute".
What does this mean and how to I fix it? I have tried many approaches, adding dummy variables, changing my labels role/type/etc.
Martin,
Good afternoon. I successfully gotten a confusion matrix output through trial and error, however, the accuracy is zero percent. Could you take a look at my process and let me know if you can see why? I think it has something to do with roles (label vs prediction) for my target variable (bankruptcy). I do not understand the critieria to have one or the other.
It seems that the Performance (Classification) operator requires a variable with a role of "prediction". Am I correct in assuming that the variable I am trying to isolate between my two clusters should be set to prediction?
When I change it from Label to Prediction, it performs the analysis, but the accuracy is zero and I don't understand why. All of the selected variables I chose are sufficiently correlated to my target variable (bankruptcy), however, the confusion matrix states an accuracy of zero. To further confuse things, there is a warning on ther performance operator "Input example set must have special attribute 'label'". My cluster model has "add as label" checked which is maybe why it does not error, but I am not sure.
When selected the Performance (Classification) operator, I see main criterion and it is currently set to "accuracy". Maybe this is the culprit. I do not see anywhere where these criterion are documented. Can you point me in the right direction? I am new to this tool and I have spent days now trying to figure this out and it is due tonight.
I replied already in your other thread. What Martin is getting at is that Clustering is unsupervised learning. Essentially you create statistical "blobs" (i know @mschmitz will groan at this) of similar data. You can easily see that Cluster 2 has tends to have higher rates of bankruptcy based on your normalized data. If you want to predict and calculate a confusion matrix, you will need to create a "label" such as "default" and "no default." Then you would use Cross Validation, measure the Classificaiton performance, and generate a confusion matrix.
With Clustering, there are ways to measure the performance but the results will not generate a confusion matrix.