Strange Results With Local Outlier Factor

Mickey · May 2010

I am getting strange results with the LOF operator: most of the "outlier" values are around the range of 0.15 instead of around 1.0.
However for most points LOF should be around 1.0 for the following reasons:
1) The LOF paper proves that LOF is around 1.0 for most points inside clusters.
2) It makes sense. From the way LOF works, you'd expect LOF around 1 for most points in clusters anyway!
3) My own implementation of a simpler variant of LOF (just average of k-dist) does give LOF of around 1 for most points.
I tried this both on my own data as well as data generated using RapidMiner, but the LOF from rapidminer is around 0.15 for both.

Here is code to recreate the synthetic test:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="476" width="681">
      <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="165">
        <parameter key="target_function" value="gaussian mixture clusters"/>
        <parameter key="number_examples" value="1000"/>
        <parameter key="number_of_attributes" value="2"/>
      </operator>
      <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data (2)" width="90" x="45" y="255">
        <parameter key="number_examples" value="20"/>
        <parameter key="number_of_attributes" value="2"/>
      </operator>
      <operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="255">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="label"/>
        <parameter key="attributes" value="label"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="255">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="447" y="165"/>
      <operator activated="true" class="detect_outlier_lof" expanded="true" height="76" name="Detect Outlier (LOF)" width="90" x="514" y="30"/>
      <connect from_op="Generate Data" from_port="output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Generate Data (2)" from_port="output" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Discretize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Append" from_port="merged set" to_op="Detect Outlier (LOF)" to_port="example set input"/>
      <connect from_op="Detect Outlier (LOF)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

land · May 2010

Hi,
unfortunately I'm not familiar with this algorithm and a short glimpse into the source code didn't make me any smarter. You seem to be at least half a way an expert for this, could you manage to take a look? If you find the problem, simply file a bug.
If you don't have the time, you could file a bug anyway, but I doubt we manage to find the problem in the next time.

Greetings,
Sebastian

Mickey · May 2010

Unfortunately there's no way I could do a code review, for many reasons (in fact, for almost every reason you can imagine!). Sorry.
In any case I'm hardly an expert!
Far from it, I am a beginner, and I could be wrong about my expectation (i.e the bug could just as well be mine and not in RapidMiner). I posted the question hoping that someone who is an expert could give me his/her opinion.

land · May 2010

Hi,
I see the problem. Hm. You have a link on the original paper? This operator has been contributed by the community, so here's nobody familiar with the implementation. We have to dive deeper into this matter, but this will take some time.

Greetings,
Sebastian

earmijo · May 2010

It is definitely a bug. I just computed LO factors for a small file with RapidMiner (version 4.6) and R (library dprep) and got very different results. The LO factors for R are in the vicinity of 1 as expected. Those for RapidMiner get very small.

RapidMiner's:

0.10705703 0.05623235 0.13564975 0.09714966 0.10411321 0.05615648 0.13563153 0.16206154 0.05677983 0.17250688 0.09351030 0.17039931 14.70213398 0.03649292 0.08855556 0.62346659 0.05777326 0.41748211 0.35321167 0.62346724 1.02022896 0.37671896 0.15250039 0.62346824 0.17060555 0.15409052 0.17671467 0.35942272 0.08493053 0.54318228 0.09604710 0.12895404 0.05779714 3.51261825 0.17676736 0.40118616 0.62368668 0.05617499 0.09426575 0.40116545

R (dprep):

1.0593654 0.9767560 1.1121496 1.0199023 1.0593438 0.9767494 1.1121422 1.1121542 0.9556079 0.9757669 1.2527428 1.1488689 5.9182867 0.9885184 1.2827731 1.3887066 0.9582217 1.1607783 1.4223003 1.3887070 1.8413799 1.2956872 1.0276760 1.3887041 1.1488966 1.0235487 1.0242877 1.1174309 1.2828083 1.7161344 1.0182297 1.0177936 0.9581262 3.0147269 1.0243710 1.5825578 1.3887215 0.9767419 0.9739126 1.5825385

land · May 2010

Hi,
thanks for the comparation values. I have added this as a bug report to the tracker.

Greetings,
Sebastian

Mickey · May 2010

Sebastian Land wrote:

Hi,
I see the problem. Hm. You have a link on the original paper? This operator has been contributed by the community, so here's nobody familiar with the implementation. We have to dive deeper into this matter, but this will take some time.

Greetings,
Sebastian

Here is the original paper for LOF: "LOF: Identifying Density-Based Local Outliers" can be found here http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.8948&;rep=rep1&type=pdf
I hope that helps.

land · May 2010

I hope so

Thank you very much.

dragoljub · July 2010

Hey Guys,

Any update on this operator? If not is there any way to extract the outlier measure from KNN outlier detection.

i.e. can we some get a ranked list of outliers instead of just the top n.

Thanks,
-Gagi

land · July 2010

Hi Gagi,
sorry for that, but no progress at all.
It would take me at least a day to get deep enough into that matter to fix it. Compared to the effort, there are many more things that must be made before. I hope you understand that we have to give priority to the issues of our enterprise customers. At last it comes down to the fact that the community version of this open source product offers you the possibility to fix it yourself and send in the patch or otherwise become enterprise customer.
Sorry that I have to repeat this so often, but the more enterprise customer we have, the more developer would be available to fix the bugs and fulfill the feature requests...

Anyway you can use one of COF or LOF based outlier detection to get an 'outlierness' attribute. You can of course sort the examples after this attribute and only extract the first n.

Greetings,
Sebastian

dragoljub · July 2010

Thanks for the reply. I will have to check into programming for RM. I know you guys are doing your best. I am just concerned about operators that may output unreliable results. Is there any standard for testing these operators. I know we in the community are sort of beta-testers but for example most people don't question the PCA or KNN results because they are so well known, however some more obscure methods may be difficult to trust.

-Gagi

land · July 2010

Hi,
I usually trust what yields good results for my task. I think it will be difficult to get some sort of Gold standard for different implementations of the same algorithm. Just think of the three different SVMs in RM: They all deliver different results on the same task, although following the same algorithm. The problem is, that there are so many numerical issues addressed differently by various implementations...But of course you can try to compare implementations on a huge set of data sets with different properties to get an impression which works better.

Greetings,
Sebastian

pengie · April 2011

I have found the bug. It is located in the file com.rapidminer.operator.preprocessing.outlier.SearchSpace.java, line 725

Just have to change from this: for (int j = 1; j <= k; j++) {

to: for (int j = 1; j <= kMax; j++) {

land · May 2011

Hi Pengie,

I have changed the source code trusting your judgment. I hope that you are right with that

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Strange Results With Local Outlier Factor

Answers