Logistic regression threshold

bernardo_pagnon · June 2018

Hello all,

I am doing a simple logistic regression exercise (no SVM, simple and pure logistic regression) and I cannot understand how rapidminer defines the threshold for classifying instances as "yes". In similar posts it was mentioned that it chooses automatically 0.5, but that is not the case. I downloaded all the "yes" predictions and sorted them in ascending order: the threshold is 0.3108. Why?

I am using the "default" instance from the ISLR library (https://cran.r-project.org/web/packages/ISLR/index.html).

Thanks in advance,

Bernardo

phellinger · June 2018

Hi Bernardo,

Logistic Regression also uses 0.5 as threshold value starting from version 7.6, see https://docs.rapidminer.com/7.6/studio/releases/7.6/changes-7.6.0.html ("Logistic Regression and Generalized Linear Model learners now use 0.5 as the threshold as other binominal learners").

The old behaviour is kept for backward compatibility reason. You can easily alter the operator's behaviour by increasing its compatibility level. (For whatever reason, it is set to 7.5.000 in your process.)

The reason for the old behaviour was that one can optimize for maximal F-measure by choosing a different threshold, but this is can be confusing. That's why this alternative threshold is only provided on a "threshold" output port now, and 0.5 is used otherwise.

Best,

Peter

bernardo_pagnon · June 2018

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve DefaultFull" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Local Repository/data/DefaultFull"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="default"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="124" name="Logistic Regression" width="90" x="380" y="34"/>
<operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="34">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve DefaultFull" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Logistic Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

kypexin · June 2018

Hi @bernardo_pagnon

The way you have built a process is wrong. You train a model on the whole data and then use already labelled data with the trained model once again, this would generate some unexpected output:

Screenshot 2018-06-18 21.02.47.png

In the simplest case you should split data before training a model, so a model is trained on, say, 80% of the data and then is applied on other 20% of examples:

Screenshot 2018-06-18 21.06.50.png

bernardo_pagnon · June 2018

Dear Peter,

you nailed it! I would never have figure it our for myself. I updated the compatibility for 8.2 and now 0.5 is the default threshold! Thank you so much!

Best,

Bernardo

bernardo_pagnon · June 2018

Dear Vladmir,

thank you for your reply. I agree that testing the model with the training data is not a good practice, but it is not wrong. By splitting the data I obtained the same error, so that was not the cause.

Best,

Bernardo

kypexin · June 2018

Hi @bernardo_pagnon

Okay my guess about the regression thresholds was really not correct, but glad @phellinger has provided this nice solution

Though I still should warn you about applying the model on a training set, it is totally possible to do this technically, but it still does not make sense as if you measure the performance then you'll get a perfect overfit model at the end, for example:

Screenshot 2018-06-19 14.06.52.png

Screenshot 2018-06-19 14.06.32.png

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Logistic regression threshold

Best Answer

Answers