The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
AUPRC with imbalanced classes
Hi, it seems I am not getting expected results when using Performance (AUPRC) with highly imbalanced dataset.
The relationship between recall and precision of positive class seems pretty intuitive, but I still get AUPRC = 0.010 regardless of anything:
I am using here imbalanced credit card fraud dataset.
At the same time when I artificially balance data, AUPRC shows expected 'normal' values:
Process attached:
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve creditcard" width="90" x="45" y="34">
<parameter key="repository_entry" value="../data/creditcard"/>
</operator>
<operator activated="true" class="sample" compatibility="8.1.003" expanded="true" height="82" name="equalize classes" width="90" x="179" y="34">
<parameter key="balance_data" value="true"/>
<list key="sample_size_per_class">
<parameter key="1" value="492"/>
<parameter key="0" value="492"/>
</list>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<operator activated="false" class="sample_stratified" compatibility="8.1.003" expanded="true" height="82" name="sample 50k" width="90" x="45" y="340">
<parameter key="sample_size" value="50000"/>
</operator>
<operator activated="false" class="create_threshold" compatibility="8.1.003" expanded="true" height="68" name="Create Threshold" width="90" x="581" y="391">
<parameter key="threshold" value="0.09"/>
<parameter key="first_class" value="0"/>
<parameter key="second_class" value="1"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.1.003" expanded="true" height="103" name="Split Data" width="90" x="246" y="136">
<enumeration key="partitions">
<parameter key="ratio" value="0.8"/>
<parameter key="ratio" value="0.2"/>
</enumeration>
<parameter key="sampling_type" value="stratified sampling"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.1.003" expanded="true" height="145" name="Validation" width="90" x="380" y="34">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="false" class="concurrency:parallel_decision_tree" compatibility="8.1.003" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="136">
<parameter key="apply_pruning" value="false"/>
<parameter key="apply_prepruning" value="false"/>
</operator>
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="246" y="34">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="h2o:deep_learning" compatibility="7.6.001" expanded="true" height="82" name="Deep Learning" width="90" x="380" y="136">
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="false" class="stacking" compatibility="8.1.003" expanded="true" height="68" name="Stacking" width="90" x="179" y="289">
<process expanded="true">
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="179" y="187">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.003" expanded="true" height="103" name="Decision Tree (2)" width="90" x="112" y="34">
<parameter key="apply_pruning" value="false"/>
<parameter key="apply_prepruning" value="false"/>
</operator>
<operator activated="true" class="h2o:deep_learning" compatibility="7.6.001" expanded="true" height="82" name="Deep Learning (2)" width="90" x="112" y="340">
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="20"/>
<parameter key="hidden_layer_sizes" value="20"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<connect from_port="training set 1" to_op="Decision Tree (2)" to_port="training set"/>
<connect from_port="training set 2" to_op="Generalized Linear Model (2)" to_port="training set"/>
<connect from_port="training set 3" to_op="Deep Learning (2)" to_port="training set"/>
<connect from_op="Generalized Linear Model (2)" from_port="model" to_port="base model 2"/>
<connect from_op="Decision Tree (2)" from_port="model" to_port="base model 1"/>
<connect from_op="Deep Learning (2)" from_port="model" to_port="base model 3"/>
<portSpacing port="source_training set 1" spacing="0"/>
<portSpacing port="source_training set 2" spacing="0"/>
<portSpacing port="source_training set 3" spacing="0"/>
<portSpacing port="source_training set 4" spacing="0"/>
<portSpacing port="sink_base model 1" spacing="0"/>
<portSpacing port="sink_base model 2" spacing="0"/>
<portSpacing port="sink_base model 3" spacing="0"/>
<portSpacing port="sink_base model 4" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.6.001" expanded="true" height="124" name="Generalized Linear Model (3)" width="90" x="45" y="34">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<connect from_port="stacking examples" to_op="Generalized Linear Model (3)" to_port="training set"/>
<connect from_op="Generalized Linear Model (3)" from_port="model" to_port="stacking model"/>
<portSpacing port="source_stacking examples" spacing="0"/>
<portSpacing port="sink_stacking model" spacing="0"/>
</process>
</operator>
<connect from_port="training set" to_op="Generalized Linear Model" to_port="training set"/>
<connect from_op="Generalized Linear Model" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="apply on train" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="operator_toolbox:performance_auprc" compatibility="1.0.000" expanded="true" height="82" name="perf train" width="90" x="246" y="34">
<parameter key="main_criterion" value="AUPRC"/>
<parameter key="AUC" value="true"/>
<parameter key="AUPRC" value="true"/>
</operator>
<connect from_port="model" to_op="apply on train" to_port="model"/>
<connect from_port="test set" to_op="apply on train" to_port="unlabelled data"/>
<connect from_op="apply on train" from_port="labelled data" to_op="perf train" to_port="labelled data"/>
<connect from_op="perf train" from_port="performance" to_port="performance 1"/>
<connect from_op="perf train" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="apply on test" width="90" x="581" y="136">
<list key="application_parameters"/>
</operator>
<operator activated="false" class="select_recall" compatibility="8.1.003" expanded="true" height="82" name="Select Recall" width="90" x="581" y="289">
<parameter key="min_recall" value="0.8"/>
<parameter key="positive_label" value="1"/>
</operator>
<operator activated="false" class="apply_threshold" compatibility="8.1.003" expanded="true" height="82" name="Apply Threshold" width="90" x="715" y="289"/>
<operator activated="true" class="performance" compatibility="8.1.003" expanded="true" height="82" name="perf test" width="90" x="715" y="136"/>
<operator activated="true" class="operator_toolbox:performance_auprc" compatibility="1.0.000" expanded="true" height="82" name="perf test (2)" width="90" x="849" y="136">
<parameter key="main_criterion" value="AUPRC"/>
<parameter key="accuracy" value="false"/>
<parameter key="AUPRC" value="true"/>
</operator>
<connect from_op="Retrieve creditcard" from_port="output" to_op="equalize classes" to_port="example set input"/>
<connect from_op="equalize classes" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Validation" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="apply on test" to_port="unlabelled data"/>
<connect from_op="Validation" from_port="model" to_op="apply on test" to_port="model"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="apply on test" from_port="labelled data" to_op="perf test" to_port="labelled data"/>
<connect from_op="Select Recall" from_port="example set" to_op="Apply Threshold" to_port="example set"/>
<connect from_op="Select Recall" from_port="threshold" to_op="Apply Threshold" to_port="threshold"/>
<connect from_op="perf test" from_port="performance" to_op="perf test (2)" to_port="performance"/>
<connect from_op="perf test" from_port="example set" to_op="perf test (2)" to_port="labelled data"/>
<connect from_op="perf test (2)" from_port="performance" to_port="result 2"/>
<connect from_op="perf test (2)" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
0
Answers
Hi @kypexin,
isn't that exactly what you would expect? AUPRC is NOT independend of class balance. If you add more and more of one class, then the precision will go down for the other class. Thus the curve becomes flatter and the integral less. 0.5 is thus not the lower threshold anymore.
Best,
Martin
Dortmund, Germany
Hi @mschmitz
Honestly, no, I have expected it exactly the other way around.
If we assume that the curve shows precision against the recall of the same positive class (in our case '1'), then varying recall of positive class gives the following:
Low recall, high precision (6/100)
High recall, low precision (93/6)
Around optimum (80/80)
Or do I interpret AUPRC completely wrong? (never used it before in practice)
Vladimir
http://whatthefraud.wtf
PS @mschmitz to give you more intuition, this is a PR curve I am getting on my data (it least what I understand to be that curve)
Vladimir
http://whatthefraud.wtf
Hi @kypexin,
what happens if you switch class balance? it should go down, right?
Best,
Martin
Dortmund, Germany
Not sure if I got you right, @mschmitz
If I just remap classes, I will get AUPRC = 0.999 and also this (obviously for majority class it will be really close to 1):
However this still does not give me an intuition why in thge 1st case AUPRC = 0.010 while it should be not to my logfical expectation.
Vladimir
http://whatthefraud.wtf
Hey @kypexin,
Here is how i see this. If you have a different class balance you transform the space. Essentially Recall for your positive class stays the same, but the precision for a given recall point changes. This may look like this:
Upper: Normal PR-Curve, Lower with a different Class Ratio
If you have a look at the math, you can see Precision as a function of recall like this:
adding more Negative falues will lead to more FN (false negatives) and thus less precision. So naturally AURPC drops with changing class balance (if the classifer does not counter this.)
Dortmund, Germany
Hey @mschmitz
I totally agree with the point that "adding more Negative falues will lead to more FN (false negatives) and thus less precision, so naturally AURPC drops with changing class balance". But at the same time, I observe influence of class imbalance on AUPRC is realy lower then we would expect.
I made tests on different imbalance ratio datasets, with 1:1, 1:10, 1:100 and 1:500 class ratios. Below are the PR curves for that cases. As we see, while imbalance increases, AUPRC drops, but not really much.
class ratio 1:1 class ratio 1:10
class ratio 1:100 class ratio 1:500
So the question is, why the operator itself provides AUPRC values non-relevant to these plots, unless of course I am committing some serious mistake.
I attach my process which is used for estimating these curves, plus my test labelled dataset as well from which different ratios can be sampled.
Vladimir
http://whatthefraud.wtf
Hey @mschmitz - could you please elaborate regarding my latest plots / messages in this thread?
This issue seems still not clear to me.
Vladimir
http://whatthefraud.wtf
@kypexin,
ive done some tests. Attached is my project on your data. For me the AUPRC drops heavily, as expected.
Where the left coloum is the number of negative examples and the right one is the AUPRC. There is also a way to visualize the AURPC exactly like the operator does it.I think one good question is: How to handle missings in the integral. since i copied most of the code from AUC the handling is the same.
BR,
Martin
Dortmund, Germany
@mschmitz -- please look.
If we take each sample size separately (I did it for the value of 100 for example) and then visualize precision against recall, we can get two meaningful (to my understanding) charts:
Precision vs. Recall, as series
Here we see that while recall goes from 0 to 1, all the way precision slowly goes downwards, from 1 to 0.5. Correct?
In a scatter plot, we basically see the same, just from a different perspective.
Now, my question is -- can you please point out what part in this plot exactly counts as an area under curve? If we connect all the points together, we, basically, will get a precision-recall curve, right? So what is the area under it?
PS same plots for sample size = 500
Sorry, my brain has started to exhaust smokes already )
Vladimir
http://whatthefraud.wtf
Hey @kypexin,
to be honest i've only adapted our AUC performance measure and copied all of the code I've only changed from TPR/FPR to precision/recall. So the Java code for AUC is fairly similar.
#1 Generate these points
These are the same as for AUC. That's why we can use Extract ROC Curve.
#2
For each point in rocData:
double fpDivN = point.getFalsePositives() / rocData.getTotalNegatives();
This is Recall and Precision. Then we do the "summation"
and store the last value:
last = new double[] { fpDivN, tpDivP };
That makes a lot of sense for me..?
Cheers,
Martin
Dortmund, Germany
Well @mschmitz in case of ROC curve it is clear what is the area under it; looking at the visualizations I made for PRC, it is not really clear, because I cannot literally see where and why for sample size 500 AUPRC = 0.35 and this is the problem here Curve with area under it lower than 0.5 would be hanging lower than the diagonal line, isn't it??
Vladimir
http://whatthefraud.wtf
Hi,
there was at least one bug.. For some crazy reason the Recall calculation was for the negative class, while the precision was for the positive class. It's fixed now.
Do you know a good way to check if it's working as expected?
BR,
Martin
Dortmund, Germany
Did you updated the operator itself? I could test it as soon as it is available.
But still, another really important thing to consider in a future is a curve visualization. Because, as we saw, the number itself often does not give much intuition.
Vladimir
http://whatthefraud.wtf
Operator is updated and will be released in the next release of toolbox. I've taken the class' recall..
~Martin
Dortmund, Germany
Thanks Martin! truly appreciate your help.
Vladimir
http://whatthefraud.wtf
Can you please advise where I can get hold of the said operator with PRC curve visualisation?
Thanks
Narayan
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts