The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Strange Results With Local Outlier Factor
I am getting strange results with the LOF operator: most of the "outlier" values are around the range of 0.15 instead of around 1.0.
However for most points LOF should be around 1.0 for the following reasons:
1) The LOF paper proves that LOF is around 1.0 for most points inside clusters.
2) It makes sense. From the way LOF works, you'd expect LOF around 1 for most points in clusters anyway!
3) My own implementation of a simpler variant of LOF (just average of k-dist) does give LOF of around 1 for most points.
I tried this both on my own data as well as data generated using RapidMiner, but the LOF from rapidminer is around 0.15 for both.
Here is code to recreate the synthetic test:
However for most points LOF should be around 1.0 for the following reasons:
1) The LOF paper proves that LOF is around 1.0 for most points inside clusters.
2) It makes sense. From the way LOF works, you'd expect LOF around 1 for most points in clusters anyway!
3) My own implementation of a simpler variant of LOF (just average of k-dist) does give LOF of around 1 for most points.
I tried this both on my own data as well as data generated using RapidMiner, but the LOF from rapidminer is around 0.15 for both.
Here is code to recreate the synthetic test:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="476" width="681">
<operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="165">
<parameter key="target_function" value="gaussian mixture clusters"/>
<parameter key="number_examples" value="1000"/>
<parameter key="number_of_attributes" value="2"/>
</operator>
<operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data (2)" width="90" x="45" y="255">
<parameter key="number_examples" value="20"/>
<parameter key="number_of_attributes" value="2"/>
</operator>
<operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="attributes" value="label"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="255">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="447" y="165"/>
<operator activated="true" class="detect_outlier_lof" expanded="true" height="76" name="Detect Outlier (LOF)" width="90" x="514" y="30"/>
<connect from_op="Generate Data" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data (2)" from_port="output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Detect Outlier (LOF)" to_port="example set input"/>
<connect from_op="Detect Outlier (LOF)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
unfortunately I'm not familiar with this algorithm and a short glimpse into the source code didn't make me any smarter. You seem to be at least half a way an expert for this, could you manage to take a look? If you find the problem, simply file a bug.
If you don't have the time, you could file a bug anyway, but I doubt we manage to find the problem in the next time.
Greetings,
Sebastian
In any case I'm hardly an expert!
Far from it, I am a beginner, and I could be wrong about my expectation (i.e the bug could just as well be mine and not in RapidMiner). I posted the question hoping that someone who is an expert could give me his/her opinion.
I see the problem. Hm. You have a link on the original paper? This operator has been contributed by the community, so here's nobody familiar with the implementation. We have to dive deeper into this matter, but this will take some time.
Greetings,
Sebastian
RapidMiner's:
0.10705703 0.05623235 0.13564975 0.09714966 0.10411321 0.05615648 0.13563153 0.16206154 0.05677983 0.17250688 0.09351030 0.17039931 14.70213398 0.03649292 0.08855556 0.62346659 0.05777326 0.41748211 0.35321167 0.62346724 1.02022896 0.37671896 0.15250039 0.62346824 0.17060555 0.15409052 0.17671467 0.35942272 0.08493053 0.54318228 0.09604710 0.12895404 0.05779714 3.51261825 0.17676736 0.40118616 0.62368668 0.05617499 0.09426575 0.40116545
R (dprep):
1.0593654 0.9767560 1.1121496 1.0199023 1.0593438 0.9767494 1.1121422 1.1121542 0.9556079 0.9757669 1.2527428 1.1488689 5.9182867 0.9885184 1.2827731 1.3887066 0.9582217 1.1607783 1.4223003 1.3887070 1.8413799 1.2956872 1.0276760 1.3887041 1.1488966 1.0235487 1.0242877 1.1174309 1.2828083 1.7161344 1.0182297 1.0177936 0.9581262 3.0147269 1.0243710 1.5825578 1.3887215 0.9767419 0.9739126 1.5825385
thanks for the comparation values. I have added this as a bug report to the tracker.
Greetings,
Sebastian
I hope that helps.
Thank you very much.
Any update on this operator? If not is there any way to extract the outlier measure from KNN outlier detection.
i.e. can we some get a ranked list of outliers instead of just the top n.
Thanks,
-Gagi
sorry for that, but no progress at all.
It would take me at least a day to get deep enough into that matter to fix it. Compared to the effort, there are many more things that must be made before. I hope you understand that we have to give priority to the issues of our enterprise customers. At last it comes down to the fact that the community version of this open source product offers you the possibility to fix it yourself and send in the patch or otherwise become enterprise customer.
Sorry that I have to repeat this so often, but the more enterprise customer we have, the more developer would be available to fix the bugs and fulfill the feature requests...
Anyway you can use one of COF or LOF based outlier detection to get an 'outlierness' attribute. You can of course sort the examples after this attribute and only extract the first n.
Greetings,
Sebastian
-Gagi
I usually trust what yields good results for my task. I think it will be difficult to get some sort of Gold standard for different implementations of the same algorithm. Just think of the three different SVMs in RM: They all deliver different results on the same task, although following the same algorithm. The problem is, that there are so many numerical issues addressed differently by various implementations...But of course you can try to compare implementations on a huge set of data sets with different properties to get an impression which works better.
Greetings,
Sebastian
Just have to change from this: for (int j = 1; j <= k; j++) {
to: for (int j = 1; j <= kMax; j++) {
I have changed the source code trusting your judgment. I hope that you are right with that
Greetings,
Sebastian