The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Cluster/group by attribute ranges and classification density"
I have a transactional data with 20 attributes a1 to a20 (numerical, nominal, binominal) and 4000 examples. Attribute a20 is classifiaction (binominal with value 1 or 0). I have to group/cluster data in the following way:
1. 3 to 6 groups/clusters (it can be a fixed number, e.g. 5)
2. groups have to be sorted ranges of numeric attribute a1. Values of a1 are between 100 and 10000. (sounds like discretization operators can be used)
3. main criterion is min/max density of negative classified examples (those with value of a20 equal to 0) in a given range.
Example set has roughly 10 % of those classified as negative. One example of a solution would be:
C1, group where a1 €[100,1000], density of negative = 30% (if there were 500 examples, 60 would be negative)
C2, group where a1 €<1000,2000], density of negative = 5% (if there were 100 examples, 50 would be negative)
C3, group where a1 €<2000,3000], density of negative = x (if there were y examples, x*y would be negative)
C4, group where a1 €<3000,6000], density of negative = x (if there were y examples, x*y would be negative)
C5, group where a1 €<6000,10000], density of negative = x (if there were y examples, x*y would be negative)
The goal is to group examples such that there are few groups of a1 and that in each group there are as much or as few negative examples as possible.
Which approach/process could solve this grouping/discretization problem? I have been unsuccesfully trying to cluster it for some time now.
1. 3 to 6 groups/clusters (it can be a fixed number, e.g. 5)
2. groups have to be sorted ranges of numeric attribute a1. Values of a1 are between 100 and 10000. (sounds like discretization operators can be used)
3. main criterion is min/max density of negative classified examples (those with value of a20 equal to 0) in a given range.
Example set has roughly 10 % of those classified as negative. One example of a solution would be:
C1, group where a1 €[100,1000], density of negative = 30% (if there were 500 examples, 60 would be negative)
C2, group where a1 €<1000,2000], density of negative = 5% (if there were 100 examples, 50 would be negative)
C3, group where a1 €<2000,3000], density of negative = x (if there were y examples, x*y would be negative)
C4, group where a1 €<3000,6000], density of negative = x (if there were y examples, x*y would be negative)
C5, group where a1 €<6000,10000], density of negative = x (if there were y examples, x*y would be negative)
The goal is to group examples such that there are few groups of a1 and that in each group there are as much or as few negative examples as possible.
Which approach/process could solve this grouping/discretization problem? I have been unsuccesfully trying to cluster it for some time now.
Tagged:
0
Answers
Hope you guys had a ball in Dortmund, sorry I wasn't able to attend. Consider yourselves spared! As to your problem I'm sure there are better ways to solve this, but let the following kick off proceedings...
I've looked at this as a regression optimisation problem, you need to minimise the difference between the average value of att20 groupings and the best they could be, which is 1 in all cases. So you grind up the averages as your prediction and check the difference, like this... Now you can get very fancy about this, change the type of binning and so on; but the thing to notice is how simple it is to do optimisations in RM, and what is more important, how easy it is to alter them!
Have fun..
;D
Your approach gave me good ideas, and I thank you for it. I forgot to mention that this approach won't work on this problem since 1 in all cases is not the optimal (best they could be) solution because it is binominal problem (max one class in one lot means it should be min in adjacent lot). Best solution is when in odd lots these averages are as close to 1 and in even lots as close to zero (or vice versa). Anyways, I used your process in a similar fashion to obtain a good solution: I created 400 bins. By logging I can see the averaging on each of 400 bins. Merging it again I got 40 bins (drawing the function of these values made it a simple task since all bins are same size). Afterwards I repeated the process and manually made 5 bins. Thanks for your time.
Cheerz,
Marin