The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
how to remove outliers by ID or in general?
hi,
I tried out 3 different outlier de tection algorithms on my dataset...
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Master3Klassen_nominal" width="90" x="45" y="30">
<parameter key="repository_entry" value="../../../data/Master3Klassen_nominal"/>
</operator>
<operator activated="true" class="generate_id" compatibility="7.2.001" expanded="true" height="82" name="Generate ID" width="90" x="112" y="165"/>
<operator activated="true" class="normalize" compatibility="7.1.001" expanded="true" height="103" name="Normalize" width="90" x="246" y="165">
<parameter key="method" value="range transformation"/>
<parameter key="min" value="-1.0"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="124" name="Multiply" width="90" x="346" y="142"/>
<operator activated="true" class="anomalydetection:Connectivity-Based Outlier Factor (COF)" compatibility="2.3.002" expanded="true" height="103" name="Connectivity-Based Outlier Factor (COF)" width="90" x="581" y="442">
<parameter key="k" value="5"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<operator activated="true" class="anomalydetection:Local Outlier Factor (LOF)" compatibility="2.3.002" expanded="true" height="103" name="Local Outlier Factor (LOF)" width="90" x="581" y="187">
<parameter key="k_min (MinPtsLB)" value="3"/>
<parameter key="k_max (MinPtsUB)" value="30"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<operator activated="true" class="anomalydetection:k-NN Global Anomaly Score" compatibility="2.3.002" expanded="true" height="103" name="k-NN Global Anomaly Score (2)" width="90" x="514" y="75">
<parameter key="k" value="5"/>
<parameter key="use k-th neighbor distance only (no average)" value="true"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<connect from_op="Retrieve Master3Klassen_nominal" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="k-NN Global Anomaly Score (2)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Local Outlier Factor (LOF)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Connectivity-Based Outlier Factor (COF)" to_port="example set"/>
<connect from_op="Connectivity-Based Outlier Factor (COF)" from_port="example set" to_port="result 6"/>
<connect from_op="Connectivity-Based Outlier Factor (COF)" from_port="model" to_port="result 5"/>
<connect from_op="Local Outlier Factor (LOF)" from_port="example set" to_port="result 3"/>
<connect from_op="Local Outlier Factor (LOF)" from_port="model" to_port="result 4"/>
<connect from_op="k-NN Global Anomaly Score (2)" from_port="example set" to_port="result 1"/>
<connect from_op="k-NN Global Anomaly Score (2)" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>
now,
I want to sort outliers by most significant ones on top (e.g. TOP 100 outliers) for all 3 detector algorithms, and select those, that have the same ID under the top 100...
and then I want to remove them out of the dataset, identified by those same ID's, is this somehow possible? I don't know how to do this...
0
Answers
you can use sort operator, select the outlier score as your column for sorting and then select descending.
After that use the "Filter Examples Range" and provide 1 as first example and 100 as last example that will give you the top 100 outliers
ok thanks, and is there any way to do this in a combined fashion for more than 1 outlier algorithm, e.g I have 4 columns of different outlier algos and I sort all of them descending, and pick the ID's with the TOP 100 highest outlier scores of all 4 Algorithms... then I want to select those ID's that are present in all the 4 algos,
like SVMLIB_Outlier_ID in COF_ID in LOF_ID... etc. how should I do that formula for it? or how to do the process?
Hi,
have you tried Generate Aggregation to get the max() or the avg() of the 4 scores and filter on it?
~Martin
Dortmund, Germany
no didnt try avg or max yet, but I want a range, lets say 2 sigma of deviation from the outliers, that means 10% of my original dataset... therefore I have to sort by highest outlier rank first, and cut the first 10% of the entries ...
but I want to do that now for those ID's only, that are present in all the Top 10% of all 4 outlier algorithms scores... is that possible to compare ID's for presence in the top 10% of a column or so?
Hi Fred,
my idea would be:
Loop Attributes on the 4 outlier scores, take the top k with Filter Examples Range and use join afterwards to find those which where present in all of them.
~Martin
Dortmund, Germany
ok thanks I will try that