The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Correlation, weird behavior"
[begin edit] Dear All, [end edit]
The following data has correlation: 0.999
I never knew that correlation was so much effected by outliers.
Best regards,
Wessel
The following data has correlation: 0.999
# sum prediction(sum) a1 a2 a3Is this how correlation is supposed to work?
1 6.0 11.06672979160903 1.0 2.0 3.0
2 9.0 11.066728936515114 2.0 3.0 4.0
3 15.0 11.066735677516975 9.0 2.0 4.0
4 11.0 11.066728936098524 4.0 5.0 2.0
5 16.0 11.06672900369881 6.0 1.0 9.0
6 5.0 11.066728942691093 0.0 3.0 2.0
7 4.0 11.066728979026438 0.0 3.0 1.0
8 9.0 11.066728936099063 3.0 5.0 1.0
9 359.0 349.5374686083969 344.0 8.0 7.0
I never knew that correlation was so much effected by outliers.
Best regards,
Wessel
Tagged:
0
Answers
what about saying hello before bursting out some statement?
Regarding your question: Yes it is. Correlation is built upon the average of the covariances which are the products from the difference of each value to it's attribute's mean value.
Or do you suggest that we have an error in the calculation routine? Then please specify the process you used and give some comparable results from another software.
Greetings,
Sebastian
Just to be sure I ran the same experiment both in WEKA and in Rapid-Miner.
Both give the same results.
So no, the calculation is fine.
(Chances of Rapid-Miner being wrong are small :P,
Chances of both WEKA and rapid-miner being wrong are really small)
It seems undesirable that a performance measure is very depended on trivial things, such as outliers in the data.
So when using correlation as a performance measure, it is very important to keep this behavior in mind.
I'm thinking about a modified correlation measure that that is more robust with respect to outliers.
Simply rescaling won't do the job, because covariances are in-depended on scaling.
do you know any literature about that? It seems very likely to me, that some else already stumbled over this issue.
And you are right. One have to keep that in mind, but when you are thinking about the plot of your values, every human would assume that there's a linear dependency.
Greetings,
Sebastian