Understanding mixed euclidean distance calculation for polynomial and nominal attributes
Hi!
I'm aware of some previous posts about how the mixed euclidean distance is calculated. My understanding is that for numeric attributes it is standard euclidean claculation whereas for nominal attributes a distance of 1 is accounted if both values are not the same.
However, I cannot make sense of the results I am getting for a simple example where I have polynomial and nominal attrbutes (which I expected that would be accounted the same way).
The data is as follows:
REQUEST EXAMPLE
1 | 10 | 7 | 5 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
REFERENCE EXAMPLES
1 | 1 | 2 | 8 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
2 | 15 | 4 | 5 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
3 | 15 | 4 | 5 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
The first column is the row id, the second column is a class attrbiute (ignored in calculation), the third and fourth columns are polynomial and the rest are binomial.
The output is:
1.0 | 1.0 | 0.0 |
1.0 | 2.0 | 1.4142135623730951 |
1.0 | 3.0 | 1.4142135623730951 |
How can the distance between the request example and the first of the reference examples be zero? Most likely, it is a very obvious calculation but I cannot see it...
I would appreciate some help!
My thanks!
Comments
Can you post your XML---it is hard to see how you have your operator configured, and it could be something in the parameter setting (e.g., only looking at nominal and not numerical attributes, etc).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Sure! Many thanks for the prompt response.
So I can't see your original data here, but I created a simple test process along the lines you explained. And everything seems to be working normally here. Take a look a this process:
This seems to be working as expected. The record that is a duplicate shows a distance of zero. The ones that have differences in the 3 numerical attributes are being calculated in the expected way. And the one record that has the same numerical values but 3 different categorical attributes has a distance value of sqrt(3) as expected.
So here are a few ideas for you to troubleshoot in your own setup:
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
There are no duplicate examples nor numeric attributes. I am attaching the data to this post.
I am sure that both sets of examples have the exact number of attributes and that the attributes are named the same, have the same type, and are in the same order. The id is labelled as id, the class is a label, imput and grav are nominal attributes amd the rest of the attributes are bonomial.
Many thanks!
I am confused---in the dataset you supplied, none of the conditions you specified appear to be true!
These discrepancies would certainly explain why you are not getting the expected results. You should harmonize your datasets in terms of number of attributes and data types, correct discrepancies as needed, and try the operator again.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
So sorry, I included three datasets instead of two, hence your confusion. I'm attaching the data agians (also as CSV files) and some screenshots of the data and the statistics as presented in RM.
In short, I have one example (small request) that I want to compare against three examples (small reference).
You are right in that the examples have the same values for all the binomial attributes. However, the values for imput and grav (the two polynomial attribs) are not always the same.
How can the distance between the request and the reference #1 be zero if they have different values for these attributes?
Yep, I agree, these results are fishy.
@mschmitz might know something more about what is going on with this cross-distance calculation. It doesn't seem to like those initial polynominal attributes (not the binominal ones). Is this a bug in the implementation of cross-distance? Or is there some other weird effect going on here that is not obvious?
@sgenzer you might also remember, there was a related problem with cross-distance earlier in the year. Do you know what ever happened with this thread: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Cross-Distances-operator-Weird-results/m-p/46161
It looks like it was simply abandoned, but combined with this thread, it makes me think there is likely a problem with this operator...
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @alourenco, @Telcontar120,
i've ran a few tests and it looks like a bug. I will file a ticket.
BR,
Martin
CC: @sgenzer
Dortmund, Germany