The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
column-1's value is based on column-2 value; should we remove one?
The Database Table which I have taken for the Clustering purpose, has some clumns which are calculated based on the values in the other columns.
E.g.
col-M col-N=(5*col-M) col-R col-S col-T
--------------------------------------------------------------
x 5x a c a-c
y 5y b d b-d
In such cases, is it better to remove the redundant columns(apart from the ones which will be helpful to interpret the clustering results)?
E.g.
col-M col-N=(5*col-M) col-R col-S col-T
--------------------------------------------------------------
x 5x a c a-c
y 5y b d b-d
In such cases, is it better to remove the redundant columns(apart from the ones which will be helpful to interpret the clustering results)?
0
Answers
Here is what I am thinking:
Removing "redundant" columns is not that easy:
To refer to your example
R S T(=R-S)
1 1 0
2 2 0
The distance using euclidean metric constrained to R and S is Squareroot(2), the distance constrained to T is 0. So be careful when you are removing columns...
On the other side: Assuming that the created columns are necessary, keeping even the redundant columns will at worst increase the absolute distance between to items.
regards,
Steffen
PS: I cannot hold back to remark, that the results (in comparsion by just using all original columns) may change if your additional columns have been calculated on wicked (e.g. nonlinear) transformations which your metric cannot cope with. But I guess you are aware of that.