The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Number of Clusters for Support Vector Clustering (SVC)
Muhammed_Fatih_
Member Posts: 93 Maven
Dear community,
I applied the SVC approach based on high dimensional data with the default setting (kernel type: radial) and got only one sole cluster as result. This suprised me a lot.
How to set the number of clusters for SVC? In this connection, is there a possibility to evaluate and validate the number of clusters of SVC by a performance operator within RapidMiner?
Thanks in advance for your answers!
Best regards!
I applied the SVC approach based on high dimensional data with the default setting (kernel type: radial) and got only one sole cluster as result. This suprised me a lot.
How to set the number of clusters for SVC? In this connection, is there a possibility to evaluate and validate the number of clusters of SVC by a performance operator within RapidMiner?
Thanks in advance for your answers!
Best regards!
Tagged:
0
Answers
Best regards!
Hello
Please take a screen from the cluster. Did you try Auto Model for that?
Thank you
Sara
Auto model does not provide SVC as I know.
I applied the SVC operator on my high-dimensional database by considering the following parameter setting: minpts=10, gamma=0,005 and p=0,01. And I got the attached cluster:
So which parameter constellation is needed or rather would you propose for high dimensional data? I think this is the elemental question here. Or what do you think?
I thank you in advance for your feedback!
Best regards!
Hello
From my understanding you have 2 clusters, It shows that your data have very similar parts. So from your first text if you have 1 cluster they are very similar with each other but if you have 2 clusters like your screen, RM can divided you data in 2 parts. I think 2 cluster is better than 1. Also if you need to compare clusters with 2 cluster that is possible.
Finally it depends on your work and your data.
I hope this helps
Sara
If you need to specify the number of clusters in advance, you should try k-means.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hello @Telcontar120
thank you for your interesting feedback! Yes, it is correct that the SVC Clustering detects two clustering groups based on the default operator parameters.
The statement of @Telcontar120 is especially the one I am interested in:
According to which criteria should these parameter settings be changed? Is it the number of input data which is considered for the clustering process? Which parameters should be changed and in which extent should they be modified? So the question targets more what the parameters do in detail.
I hope this clarification helped to underline the focus of my question. I thank you in advance for your answers!
Best regards & Stay healthy!
Hello
It depends on your data. If they are very similar with each other , it is very difficult to separate them in different clusters but I think totally you should find a central point for each clusters in your data, in this situation you will understand more about your data and can understand better about your clusters. Now try to visualize your data then you will see every thing or you can make a curve line with your data then according to the points that show the carve change you can have number of clusters. I recommend you first cluster your data with Auto Model with K means or C means then choose best number of clusters. ( I want you see first your data very clear then decide for that so the first step is visualization. )
For more information:
This operator is an implementation of Support Vector Clustering based on Ben-Hur et al (2001). In this Support Vector Clustering (SVC) algorithm data points are mapped from data space to a high dimensional feature space using a Gaussian kernel. In feature space the smallest sphere that encloses the image of the data is searched. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters. Since the contours can be interpreted as delineating the support of the underlying probability distribution, this algorithm can be viewed as one identifying valleys in this probability distribution.
https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/support_vector_clustering.html
http://www.scholarpedia.org/article/Support_vector_clustering
Kind regards
Sara
thank you for your feedback!
I've already evaluated the number of clusters by considering the Kmeans clustering approach. I agree that this should be the first step before investigating other clustering techniques.
In this sense I wanted to subsequently apply SVC to be able analyze how many numbers of clusters will SVC detect. As I have mentioned above, the SVC detected two clusters (one of them very small) with the default parameter setting, whereas kMeans detected 7 cluster groups. This anomaly confused me a bit.
Therefore the question if this could be an issue of parameter optimization due to the reason I am considering a high dimensional database. In this connection, the paper of Ben-Hur et al. (2001) unfortunately does not evaluate varying paremeter settings. It is therefore not clear which parameter setting would be the appropriate one for my data.
Which setting would you choose for a database with: 70.000 objects/lines and 8.000 attributes/columns?
Best regards!
Remove correlated attributes, use PCA to understand which attributes explain the variance of your data. Maybe you'll end up workin with less than 30% of the initial attributes.
If you want to "play" with the parameters and understand if any change on them affects the number of clusters returned then use the Optimization parameter and define some ranges for the parameters this way you can test a wide range of configurations and see if they have any impact on your data.
If you had any label (not used for clustering) on your data you could then use the Weight of Evidence operator to transform the values of some of your Numerical attributed so that the separation increases.
Don´t forget to apply Normalization on your Numerical Data since outliers affect clusters due to their nature of finding the centroids for the clusters.
Hope this information is useful.
very important and useful insights, which I have already partly implemented. I applied PCA based on the rough data set and derived the described data.
Your recommendation with regard to the Optimization parameter is a very good one. Here is again the question of which parameters should be optimized if we exclude the challenge with the running time. As @sara20 mentioned:
Which SVC paramters would you optimize and in this connection in which iteration steps?
Best regards!
But if you read through the help text of the SVC operator, you will find that two parameters that are highly significant in determining the number of clusters are p, the proportion of outliers allowed, and r, the target radius of the clusters. Their default settings may not be giving you the optimal number of clusters.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Muhammed_Fatih_,
I agree with all people and number of clusters depend on your data.
I hope this helps
Sara
thank you for your feedback. Optimizing the mentioned parameters seems to be an appropriate way of determining the paramters. In this connection, is there a evaluation measure which fits to SVC? As I know, there is no one implemented in RapidMiner. Can you confirm this information?
Best regards!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
are the operators "Cluster distance performance" and "Cluster density performance" applicable for SVC?
E.g. the documentation states the following: "This operator is used for performance evaluation of centroid based clustering methods.". Hence, SVC does not belong to the centroid based clustering approache as well as the second operator for densitiy based clusters.
Do the both performance operators anyway fit with SVC?
Best regards
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts