The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
The role of dimensionality reduction with regard to Clustering approaches
Muhammed_Fatih_
Member Posts: 93 Maven
in Help
Hello Community,
I plan to evaluate several Clustering techniques on a TF-IDF bag of words representation where I've previously executed a feature selection to efficiently reduce the number of dimensions of my vector space. In this sense, I've read that Feature Extraction/Transformation approaches get better results with regard to dimensionality reduction in comparison to Feature Selection ones if Clustering algorithms will be applied afterwards. First of all, how do you see this opinition out of theory?
Secondly, as explained I've still executed Feature Selection. Would it be correct to additionally execute Feature Extraction based on the remaining dimensions which were derived from Feature Selection? Or should the Feature Exraction for efficient Clustering should be applied on the initial rough dataset?
I thank you all for the participation and for the answers!
Best regards!
I plan to evaluate several Clustering techniques on a TF-IDF bag of words representation where I've previously executed a feature selection to efficiently reduce the number of dimensions of my vector space. In this sense, I've read that Feature Extraction/Transformation approaches get better results with regard to dimensionality reduction in comparison to Feature Selection ones if Clustering algorithms will be applied afterwards. First of all, how do you see this opinition out of theory?
Secondly, as explained I've still executed Feature Selection. Would it be correct to additionally execute Feature Extraction based on the remaining dimensions which were derived from Feature Selection? Or should the Feature Exraction for efficient Clustering should be applied on the initial rough dataset?
I thank you all for the participation and for the answers!
Best regards!
Tagged:
1
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi,another way which i really like is the combination of PCA and K-Means. This makes a lot of sense in many scenarios, because both algorithms have similar assumptions (euclidan distances and variances are often the same concept). Afterwards you can use a technique like this: https://towardsdatascience.com/understanding-clustering-cf0117148ef4 to understand what is going on.Cheers,Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany7
Answers
Based on your question, I assume that you are talking about techniques like PCA, ICA or some other things related to your data (n-grams etc). One of the major drawback with dimensionality reduction like PCA is the loss of interpretability. If you want to explain/interpret then feature selection is the way as it preserves original features. If your focus is to do dimensionality reduction then feature extraction can be done. You can use it where interpretation is not highly important.
I think both (extract/selection) of them seem similar but they have a different purpose. I am not sure if it is always correct to say that feature extraction works better than selection.
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Ingo
thank you for the literature recommendation!
However, you wrote that one should be careful when using Feature Selection and Clustering. But do you have other alternatives with regard to efficient dimensionality reduction and subsequent Clustering if you want to interprete the Clustering results afterwards as @varunm1 mentioned? I don't see any other way beside Topic Modeling approaches like LDA.
Thank you in advance for your answer!
interesting approach. So you start clustering based on the PCA values and try to give a sense to the detected clusters afterwards by using the Decision Tree, right ?
Best regards!
Dortmund, Germany
Ingo