Number of Clusters for Support Vector Clustering (SVC)

Muhammed_Fatih_ · June 2020

Dear community,

I applied the SVC approach based on high dimensional data with the default setting (kernel type: radial) and got only one sole cluster as result. This suprised me a lot.

How to set the number of clusters for SVC? In this connection, is there a possibility to evaluate and validate the number of clusters of SVC by a performance operator within RapidMiner?

Thanks in advance for your answers!

Best regards!

Muhammed_Fatih_ · July 2020

Is anybody here who can help in the described issue?

Best regards!

sara20 · July 2020

@Muhammed_Fatih_,

Hello

Please take a screen from the cluster. Did you try Auto Model for that?

Thank you
Sara

Muhammed_Fatih_ · July 2020

Hello @sara20,

Auto model does not provide SVC as I know.

I applied the SVC operator on my high-dimensional database by considering the following parameter setting: minpts=10, gamma=0,005 and p=0,01. And I got the attached cluster:

So which parameter constellation is needed or rather would you propose for high dimensional data? I think this is the elemental question here. Or what do you think?

I thank you in advance for your feedback!

Best regards!

sara20 · July 2020

@Muhammed_Fatih_

Hello

From my understanding you have 2 clusters, It shows that your data have very similar parts. So from your first text if you have 1 cluster they are very similar with each other but if you have 2 clusters like your screen, RM can divided you data in 2 parts. I think 2 cluster is better than 1. Also if you need to compare clusters with 2 cluster that is possible.
Finally it depends on your work and your data.

I hope this helps
Sara

Telcontar120 · July 2020

There is no way to explicitly set the number of clusters in advance with SVC. The point is to allow the algorithm to detect the correct number of clusters based on the underlying data. You can play with the other ML parameters to see whether that changes the number of clusters found (it usually does). As Sara noted, your results show two clusters now (java counting starts at zero so you have cluster 0 and cluster 1).
If you need to specify the number of clusters in advance, you should try k-means.

Muhammed_Fatih_ · July 2020

Hello @sara20,
hello @Telcontar120

thank you for your interesting feedback! Yes, it is correct that the SVC Clustering detects two clustering groups based on the default operator parameters.

The statement of @Telcontar120 is especially the one I am interested in:

"You can play with the other ML parameters to see whether that changes the number of clusters found (it usually does)."

According to which criteria should these parameter settings be changed? Is it the number of input data which is considered for the clustering process? Which parameters should be changed and in which extent should they be modified? So the question targets more what the parameters do in detail.

I hope this clarification helped to underline the focus of my question. I thank you in advance for your answers!

Best regards & Stay healthy!

sara20 · July 2020

@Muhammed_Fatih_,

Hello

It depends on your data. If they are very similar with each other , it is very difficult to separate them in different clusters but I think totally you should find a central point for each clusters in your data, in this situation you will understand more about your data and can understand better about your clusters. Now try to visualize your data then you will see every thing or you can make a curve line with your data then according to the points that show the carve change you can have number of clusters. I recommend you first cluster your data with Auto Model with K means or C means then choose best number of clusters. ( I want you see first your data very clear then decide for that so the first step is visualization.

)

For more information:

This operator is an implementation of Support Vector Clustering based on Ben-Hur et al (2001). In this Support Vector Clustering (SVC) algorithm data points are mapped from data space to a high dimensional feature space using a Gaussian kernel. In feature space the smallest sphere that encloses the image of the data is searched. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters. Since the contours can be interpreted as delineating the support of the underlying probability distribution, this algorithm can be viewed as one identifying valleys in this probability distribution.

https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/support_vector_clustering.html
http://www.scholarpedia.org/article/Support_vector_clustering
Kind regards
Sara

Muhammed_Fatih_ · July 2020

Hello @sara20,

thank you for your feedback!

I've already evaluated the number of clusters by considering the Kmeans clustering approach. I agree that this should be the first step before investigating other clustering techniques.

In this sense I wanted to subsequently apply SVC to be able analyze how many numbers of clusters will SVC detect. As I have mentioned above, the SVC detected two clusters (one of them very small) with the default parameter setting, whereas kMeans detected 7 cluster groups. This anomaly confused me a bit.

Therefore the question if this could be an issue of parameter optimization due to the reason I am considering a high dimensional database. In this connection, the paper of Ben-Hur et al. (2001) unfortunately does not evaluate varying paremeter settings. It is therefore not clear which parameter setting would be the appropriate one for my data.

Which setting would you choose for a database with: 70.000 objects/lines and 8.000 attributes/columns?

Best regards!

MarcoBarradas · July 2020

@Muhammed_Fatih_ what type of pre processing is done on the high dimensional database are all of those attributed adding value to the clustering? is my understanding that you could and should reduce the amount of attributes used before applying any clustering techinques.
Remove correlated attributes, use PCA to understand which attributes explain the variance of your data. Maybe you'll end up workin with less than 30% of the initial attributes.
If you want to "play" with the parameters and understand if any change on them affects the number of clusters returned then use the Optimization parameter and define some ranges for the parameters this way you can test a wide range of configurations and see if they have any impact on your data.
If you had any label (not used for clustering) on your data you could then use the Weight of Evidence operator to transform the values of some of your Numerical attributed so that the separation increases.
Don´t forget to apply Normalization on your Numerical Data since outliers affect clusters due to their nature of finding the centroids for the clusters.
Hope this information is useful.

Muhammed_Fatih_ · July 2020

Hello @MarcoBarradas,

very important and useful insights, which I have already partly implemented. I applied PCA based on the rough data set and derived the described data.

Your recommendation with regard to the Optimization parameter is a very good one. Here is again the question of which parameters should be optimized if we exclude the challenge with the running time. As @sara20 mentioned:

As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters.

Hence, this could be an option but how many iteration steps would be appropriate for SVC if the default value is set to 1.0 for gamma. On the other hand, the tutorial process given by RapidMiner sets the gamma to 0.005. According to which criteria?

Which SVC paramters would you optimize and in this connection in which iteration steps?

Best regards!

Telcontar120 · July 2020

This is hard to say in the abstract because the clustering is very dependent on your data.
But if you read through the help text of the SVC operator, you will find that two parameters that are highly significant in determining the number of clusters are p, the proportion of outliers allowed, and r, the target radius of the clusters. Their default settings may not be giving you the optimal number of clusters.

sara20 · July 2020

Hi all

@Muhammed_Fatih_,

I agree with all people and number of clusters depend on your data.

I hope this helps
Sara

Muhammed_Fatih_ · July 2020

Dear all and @sara20, @Telcontar120 and @MarcoBarradas,

thank you for your feedback. Optimizing the mentioned parameters seems to be an appropriate way of determining the paramters. In this connection, is there a evaluation measure which fits to SVC? As I know, there is no one implemented in RapidMiner. Can you confirm this information?

Best regards!

Telcontar120 · July 2020

Well, there actually are several performance operator for clusters, such as cluster distance performance and cluster density performance. You might want to check those out. But the problem with unsupervised ML in general is that there is no clear "correct" answer so the "best" cluster performance is somewhat in the eye of the beholder.

Muhammed_Fatih_ · July 2020

Hi @Telcontar120

are the operators "Cluster distance performance" and "Cluster density performance" applicable for SVC?

E.g. the documentation states the following: "This operator is used for performance evaluation of centroid based clustering methods.". Hence, SVC does not belong to the centroid based clustering approache as well as the second operator for densitiy based clusters.

Do the both performance operators anyway fit with SVC?

Best regards

Telcontar120 · July 2020

Yes, you are correct. Sorry, I thought you were asking about clustering performance operators in RapidMiner in general. I am not aware of a performance operator for SVC other than the generic Cluster Count operator, which is not really all that useful.

Muhammed_Fatih_ · August 2020

Is there anybody else who can reccomend performance evaluation for SVC?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Number of Clusters for Support Vector Clustering (SVC)

Answers