what kind of algoritmh Should I use?

Antonios1 · November 2020

Hi,

I have a dataset in wich I would like to detect a cluster, like the red dots in the attached simplified picture. I tried cluster analysis, outliers analysis by using several operators (lof, k-means, x-means, decision tree etc.) and even the auto model, but It seem I am not able to understand if I am on the right path and above all I don't know if the operators I chose are the right one. Might anybody help?

BalazsBarany · November 2020

Hi @Antonios1,

this looks like the textbook example of a distance-based outlier detection. Check out the Anomaly Detection extension on the Marketplace, and Detect Outlier (LOF) included in Studio. Try to apply the appropriate algorithm on your data, play with parameters, and visualize the results.

If this fails: try the Cross Distances operator and analyze the numerical distances between the elements. Try to find thresholds like "X neighbors inside a distance of Y" that describe the clusters the way you need them.

Regards,
Balázs

Antonios1 · November 2020

thank you for helpiing Balázs,

I tried with Studio Detect Outlier (LOF), Studio Detect Outlier (Distances) and Marketplace Local Outlier Probability (LOP).I played with the parameters. By analizing the result, I do not get significative, at leas to me, result exept for the LOF Operator where the clusters of numbers I wish to be detected has an outlier result of 0.

If it can be of help my dataset is composed of 1 column with 3047 rows. (2781 rows cointaining numbers randomly ranging from 0 to 50000 266 rows contain a fixed number that in my case is 2900) and 2900 are the ones I'd like to detect.

BalazsBarany · November 2020

Hi,

so can you use the LOF results? For example with Generate Attributes to create the Cluster attribute (outlier == 0)?

If the example set is only just one attribute, you could aggregate by that attribute value and count the results. Then you could sort by the count descending and keep the top N classes, or remove classes having less than N examples.

Regards,
Balázs

MartinLiebig · November 2020

Hi @Antonios1 ,

i think the data set just contains no outlier

. It looks like an example why LOF shows you no outlier, while a normal KNN global anomaly score does say every blue one is an outlier.

Best,

Martin

Antonios1 · November 2020

Think i have done, don't know if correct or not.

Anyway, trying to understand.... Is my outliers 0 oucome correct ? I read that higher LOF value result, detect Outliers. Maybe 0 Means the contrary, so in my case a lot of omogeneus values (2900) ?

Image: https://us.v-cdn.net/6030995/uploads/editor/ru/ufy6vb0op56z.jpg

Antonios1 · November 2020

thsnks for helping @mschmitz

what kind of algorithm might I use to detect the red ones?

MartinLiebig · November 2020

HI @Antonios1 ,

you think the red ones are outliers? Common definitions of outliers would either call nothing outliers or the blue ones..

Anyway, i've reproduced your data set and used a KNN global anomaly score on it. The outlier score seperates the gaussian cluster and the random noise very well:

Attached is the process

Best,

Martin

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="9.8.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="random"/>
        <parameter key="number_examples" value="50"/>
        <parameter key="number_of_attributes" value="2"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.8.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="label" value=""random""/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="generate_data" compatibility="9.8.000" expanded="true" height="68" name="Generate Data (2)" width="90" x="45" y="136">
        <parameter key="target_function" value="single gaussian cluster"/>
        <parameter key="number_examples" value="1000"/>
        <parameter key="number_of_attributes" value="2"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="0.5"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.8.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="179" y="136">
        <list key="function_descriptions">
          <parameter key="label" value=""gaussian""/>
          <parameter key="att1" value="att1-5"/>
          <parameter key="att2" value="att2+3"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="append" compatibility="9.8.000" expanded="true" height="103" name="Append" width="90" x="380" y="34">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <operator activated="true" class="anomalydetection:k-NN Global Anomaly Score" compatibility="2.4.001" expanded="true" height="103" name="k-NN Global Anomaly Score" width="90" x="581" y="34">
        <parameter key="k" value="10"/>
        <parameter key="use k-th neighbor distance only (no average)" value="false"/>
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="parallelize evaluation process" value="false"/>
        <parameter key="number of threads" value="8"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Generate Data (2)" from_port="output" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Append" from_port="merged set" to_op="k-NN Global Anomaly Score" to_port="example set"/>
      <connect from_op="k-NN Global Anomaly Score" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

Antonios1 · November 2020

thank you @mschmitz. Your suggestion spot exactly my omogeneous cluster of numbers hidden among the noise. 266 on 266. I also understood I can import the process :-) and I'll study it to understand more. So in the end If I am right I think I have understood that the lower the outlier value the higher probability it's the type of cluter I am looking for. isn't it ? If my assumption is correct, one of the operators suggesgted by @BalazsBarany (LOF) works correctly too by identifiying with an outlier value of 0 "Zero" , my hidden cluster. Thank you @BalazsBarany , thank you @mschmitz

MartinLiebig · November 2020

Hi @Antonios1 ,

exactly. Keep in mind, that usually outliers have a high score. In your case you search outliers which are 'normal points' in common definitions.

Best,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

what kind of algoritmh Should I use?

Comments