Clustering and classification
I have a dataset with an attribute Grade (ranging from 1 to 3) and a an attribute Explanation.
I converted the numerical Grade from to polynomial and set the label to Grade. This is my target.
Then I converted the nominal Explain attribute to text and split the data in 70% training and 30% testing.
Then I remove the outliers in the trainingset, and use the process document operator. I then tokenize, remove stopwords, stemm and use N-Gramm. The I cross validate and use k-NN for classification.
When i apply my model to the testdataset, the results are alright, but i would like to try to use a clustering algorithm, instead of a classification alghorithm with target. How do i do this and what do i need to change in my flow?
Answers
hello @Mirte welcome to the community! I'd recommend posting your XML process here (see https://youtu.be/KkgB5QXWXJ8 and "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.
Scott
I have uploaded them in the zip.
hello @Mirte - ok thank you for uploading both your xml and your data set. That is very helpful. Some preliminary observations:
1. You have only 15 rows of data. There is literally nothing you are going to glean out of this process with so few rows - eoither by predictive analytics or by clustering. Nothing. You need a LOT more data. Think of it this way: if you are splitting your data 70/30 (usually a good place to start), you are reducing your training set to 10 rows of data, leaving 5 rows for testing. Now you are taking your 10 rows of data and doing 10-fold cross-validation. This is now 9 rows of data for each fold. So you're using k-NN on nine data points for each model.
2. Going from a predictive model to clustering is a very different idea - it's just not switching from decision trees to SVMs. You're going from supervised learning to unsupervised learning.
Here's an analogy to explain the vast difference between supervised and unsupervised learning:
Say I'm in New York City. Now I want to drive to Boston. I can take any number of routes to go there but there is probably one optimal route based on whatever criteria I set. I get lots of people to drive their cars between NYC and Boston and then I look at their results; I choose the route that seems to produce the best results. Then I test this result on new drives between NYC and Boston and see how I do. This is supervised learning. You use models like decision trees, k-NN, SVMs, etc.. inside x-validation and so forth.
Now say I'm in New York City and I want to look at routes that everyone takes out of the city, irrespective of destination. I get lots of people to drive their cars from NYC to SOMEWHERE. and then look at their results. I cannot choose a route that produces the best results because that's not even a relevant question - the "best" result depends on where they are going. But what I can do is group my drivers into groups based on similarities: destination, speed, highway vs local routes, etc... So I create "groups" of drivers based on these features, e.g. "fast highway drivers between NYC and Boston", "slow highway drivers between NYC and New England" etc. This is called "clustering" and you can use k-means or similar algorithms to do this. Or you can look at your drivers and say "OK, if I have a fast highway NYC-Boston driver, what other combinations would she be most likely to do? Maybe fast highway NYC-Washington? Maybe fast local NYC-Boston? Maybe fast local NYC-New England to go skiing?" This is called "association mining" and you can use algorithms such as FP-Growth and so on.
Does this make sense? Hmm. This looks like a blog post to me!
Scott
Hi,
It does make sense but this is not my actual dataset.
I just wanted to show my flow and made a short dataset for. I have a big one in real life.
I know that i am going from supervised to unsupervised with no target.
Just practising with different kind of things.
I am not sure how to use the FP-Growth in my flow.
oh ok good. I was worried there.
Re: FP-Growth. That's a completely different animal. Have you gone through the Market Basket Analysis tutorial?