Distance measure with missing values
I would reccomend to correct distance measures with missing values or at least add some notes on how the distance is calculated (a warrning etc.).
Now the distance is calculated as:
public double calculateDistance(double[] value1, double[] value2) {
double sum = 0.0;
int counter = 0;
for (int i = 0; i < value1.length; i++) {
if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
double diff = value1[i] - value2[i];
sum += diff * diff;
counter++;
}
}
if (counter > 0) {
return Math.sqrt(sum);
} else {
return Double.NaN;
}
}
so the missing attributes are ignored, what means that for missing values the distance is smaller then for non-missing. In other words for kNN and other distance based methods the instances with missing values are prefarred/closer than the others. These leads to incorrect classification results.
The state of art pracitce is implemented as
if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
double diff = value1[i] - value2[i];
sum += diff * diff;
counter++;
} else {
double diff = max(i) - min(i);
sum += diff * diff;
counter++;
}
where max(i) and min(i) are maximum and minimum value of given attribute in the training set,
or simply diff=1 if attribute is normalized.
Comments