The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to process categorical type data using unsupervised algorithm in anomaly detection?
I encounter a problem in anomaly detection. We know that distance is measured between different instances. Now my dataset contains categorical data. I have 3 choices. First, I remove the categorical features, however, I think that there are useful messages in categorical features. Second, I transform the categorical data into numerical value using LabelEncoder of sklearn, however, I think the transform can't correspond to the distance measure. Third, I use OneHotEncoder of sklearn to process the categorical features, however, I think that the demensions of features increase and it affect clustering.
Tagged:
0
Answers
General preference is to one hot encode and yes it increases the dimensions of features but you can use PCA for dimensionality reduction on these features to reduce them. If this is not good, you can use k-modes in python which is a mixed model that can take both categorical and numeric features for clustering.
K-modes: http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf
Thanks
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing