Cluster-Analysis with wholesale customer dataset
Hello everyone,
as a group of marketing students who participate in a course called "Marketing Analytics", we now have the task to make a cluster-analysis, using different clustering-methods, on the dataset from here:
https://archive.ics.uci.edu/ml/datasets/wholesale+customers
The exact description is the following:
"The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Goal: Find Clusters of Customers"
For that, we should try out different Clustering methods (Professor told us next to k-means to try out DBSCAN and Hierachical Clustering)
Currently we did the following:
Added Operator: Read CSV -> Loaded in the Data-Set
Added Operator: Select Attributes -> Filtered out the nominal attributes Channel & Region
Added Operator: K-Means
First off we do not know how to find the optimum of "k" to use in RapidMiner? How can we get to this, how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)
As we have more than 2 attributes (Milk, Frozen, Fresh, Delicatess, Groceries, etc.) how can we visualize the clusters? What kind of clusters can we get out of this dataset?
Also, how can we use the DBSCAN Clustering ? If we just connect it with the Select Attributes operator and run it, we get only one cluster...
Our professor also told us to use some loop, is it also necessary to filter out Outliners?
Please help, we struggle a lot in this task. If someone is able to explain this task, he or she can also contact me private and I would offer something for the effort.
Thanks a lot!!
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @mluethke87,
Your project is interesting : it highlights the difficulty of clustering some data.
I investigated by beginning, what is normally a first step of the data science methodology : the Visual Data Analysis.
We are in high dimensionnal space, but we can always represent an attribute x vs an attribute y in 2D.
For example here Milk vs Grocery :
How many cluster(s) did you see ?
NB : We find this particular "distribution" of data with lot of combinaisons attribute_x vs attribute_y in your dataset.
Visually, it's difficult to answer to the question "how many clusters are there ?" . It's subjectiv - every human is different -
but we can respond number of cluster = 1 with :
- 1 for the whole dataset or
- 1 for the "the bigger or smaller cluster" in the corner at the bottom left, the rest of the data being unclassifiable, it's noise.
Now we can see, what are the clusters got from the KMeans model with k = 6 (for recall k= 6 is given by the optimization of the Davies Bouldin in the process of @Thomas_Ott) :
Secondly, we can see, what are the clusters got from the XMeans model recommanded by @Telcontar120 (model which conclude k = 4):
In both cases, we see, that, when we "force" an algorithm (Kmeans or Xmeans) to find clusters, theses clusters have very different "densities" in the case of your dataset.
but when we use DBSCAN, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster, so we define an "estimation of the density of the clusters".
So for the DBSCAN algorithm to find clusters, the clusters must have similar densities, and that's why it is not able to manage clusters of different densities and in fine it always conclude in your case with number of cluster = 1.
Second Part : RapidMiner vs Python (sorry this post is not finished yet....)
First, for this history of number of cluster = 1, I decided to compare the results of RapidMiner's DBSCAN with the
results of Python's DBSCAN (sorry @sgenzer if you read this post) : In both cases, the conclusion is number of cluster = 1.
But according to the setting of Epsilon / Min Points, Python's DBSCAN conclude that some data are "unlabelled"(it's noise) while in the case of RapidMiner, all the data are clustered in the only one cluster.
I think the conclusion of Python's DBSCAN logic. In deed, how said previously, with the definition of the DBSCAN algo, we are setting the epsilon distance and the minimum number of MinPts points to be in an epsilon radius for these points to be considered as a cluster. From my point of view, there are data points in this dataset which are isolated, and so that they should not belong to a cluster, and be considered as noise (according to the setting Epsilon / Min Points). For example, for epsilon = 1 / min points = 5, here are the conclusions of Python's DBSCAN :
NB : in red, the clustered data, in blue the "unlabelled" data
I thought that I will find this operation by checking the parameter remove unlabelled of the DBSCAN in RapidMiner, but ti is not the case.
So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?
In conclusion, I hope that I contributed to the reflection on DBSCAN and your project.
and now the post is actually finished (ouff....!)
Best regards,
Lionel
1
Answers
Hi @mluethke87,
Can you share your process and your dataset(s), please ?
Some response elements :
For the optimum number of cluster "k", there is a theorical method (but it's not sure that this method works every time ....)
You can use the K-means model associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as
Main criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize :
the value of "k" which minimizes the Davies Bouldin index is the optimum value of k... (in theory if this value exist).
Thanks to the experts for correcting me if I'm wrong.
Regards,
Lionel
Hi again @mluethke87,
Sorry, I read your post to fast, OK : your process is in attached file, and there is a link to your dataset...
Regards,
Lionel
Hey,
I attached the process. It includes as well other Clustering-Methods connected to a Multiplier.
Still right now it is all about the K-Means - and how to determine the correct number of "k" to use for this task.
People spread graphics showing the "ellbow" though I do not see any explanation showing step by step how this is done in Rapid Miner.
You said:
"..associated to the Performance (Cluster Distance Performance) operator - with the Davies Bouldin as
Main criterion - inside the Optimize Parameters operator and choose "k" as parameter to optimize.."
I inserted the operator Optimize Paramters (Grid) and it does not show any of these functions / steps that you explained
Can you visually show this because I do not get there.
Thank you
@mluethke87 You can compute the Davies Bouldin by using one of the Cluster Performance operators and use it in the Optimize Parameter operator, like below.
You should definately review the Optimization video tutorial to get familiar with it, it's a very powerful operator.
Hi @mluethke87,
To provide response elements to your question "how can we see the intradistance and so the "Ellbow" graph in rapid miner for this dataset? (I attached a graphic from a presentation i found)" :
I don't know exactly what is Ellbow graph, I don't think that RapidMiner provides such graphs.
Your first graph show the within Sum of Squares vs k (number of clusters). RapidMiner don't calculate the within Sum of Squares but the Average within centroid distance.
You can obtain a similar curve by representing the Average within centroid distance vs k using the Log operator. In your case, we obtain this curve :
and here the process :
Regards,
Lionel
Hey,
thanks a lot guys for your help already.
So in the graph you showed, it would make sense to use k=3, because the avg. centroid distance in relation to the number of clusters would be "optimal", as when you would continue using more clusters, the avg. centroid distance wouldnt grow as much anymore, correct?
Still, within the subprocess, you put in a k-means as well, which is preconfigured to 3 - in this case the number only affects the number of runs the loop makes right or does it affect anything at all?
Also, if i delete the multiply operator and just connect the loop parameters to the select attributes, the graph for avg. centroid distance created in this subprocess is different but why is this affected by it?
See screenshots attached & XML Code
Thank you!
Hi @mluethke87,
just a response element about the optimal "k" value :
- according to the graph, the optimal value of "k" seems to be 3 for the reason you give.
- according to the Davies Bouldin index, the optimal value of "k" is 6 when running the process provided by @Thomas_Ott.
Maybe you can consider the two cases in parall in your project.
Regards,
Lionel
okay thank you!
Still, can you guys please tell me why DBSCAN spits out only 1 Cluster (All Data is clustered in one ?) ? Why did our professor even mention this algorithm if it does not even fit our dataset?
Also, is there a way to show each correlation values if I compare : Milk - Grocery, etc.? So I can see if some of these categories even have a correlation at all?
Thank you!
Hi @mluethke87,
1. Maybe your professor don't want to give you the "right answer", but want that you experiment by yourself the model DBSCAN to see
its behaviour and what are its advantages and disadvantages.
So I propose you to try different combinaisons of the DBSCAN 's algorithm parameters (epsilon / min points) to determine, in each case, how many cluster(s), it conclude.
2. To determine the correlations between your attributes you can :
- Represent graphically attributes by attributes to see visualy if there is a correlation between them.
- Use the Correlation Matrix operator to see if there is a linear correlation between 2 of your attributes. This matrix product a number in range [0,1] for each couple of your attributes knowing that 0 = no correlation / 1 = perfect correlation.
Regards,
Lionel
You can also check your k-means work by using the X-means operator, which recommends/selects an optimal value for k based on the BIC (similar but not identical to the DBI method you are using manually above).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hey,
thanks again for the reply.
The problem is, I only get 1 Cluster, so it seems not to work with the DBSCAN. I looked up everywhere, but could not find a proper solution why it is like that. I played around with the epsilon and with the min points for sure.
Can you tell me why this dataset does not get clustered trough the DBSCAN? It is frustrating but I am not even sure if it is possible to work
XML File attached
Thanks!
Hi @mluethke87,
Here a possible response element :
"DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters." (extract of DBSCAN article on wikipedia) :
https://en.wikipedia.org/wiki/DBSCAN
Maybe it's the case for your dataset and that's why the DBSCAN has trouble to "isolate" some clusters.
Regards,
Lionel
Hi again,
To complete my previous post, the associated process :
Best regards,
Lionel