Logistic regression threshold
Hello all,
I am doing a simple logistic regression exercise (no SVM, simple and pure logistic regression) and I cannot understand how rapidminer defines the threshold for classifying instances as "yes". In similar posts it was mentioned that it chooses automatically 0.5, but that is not the case. I downloaded all the "yes" predictions and sorted them in ascending order: the threshold is 0.3108. Why?
I am using the "default" instance from the ISLR library (https://cran.r-project.org/web/packages/ISLR/index.html).
Thanks in advance,
Bernardo
Best Answer
-
phellinger Employee-RapidMiner, Member Posts: 103 RM Engineering
Hi Bernardo,
Logistic Regression also uses 0.5 as threshold value starting from version 7.6, see https://docs.rapidminer.com/7.6/studio/releases/7.6/changes-7.6.0.html ("Logistic Regression and Generalized Linear Model learners now use 0.5 as the threshold as other binominal learners").
The old behaviour is kept for backward compatibility reason. You can easily alter the operator's behaviour by increasing its compatibility level. (For whatever reason, it is set to 7.5.000 in your process.)The reason for the old behaviour was that one can optimize for maximal F-measure by choosing a different threshold, but this is can be confusing. That's why this alternative threshold is only provided on a "threshold" output port now, and 0.5 is used otherwise.
Best,
Peter
0
Answers
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve DefaultFull" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Local Repository/data/DefaultFull"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="default"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="124" name="Logistic Regression" width="90" x="380" y="34"/>
<operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="34">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve DefaultFull" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Logistic Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Hi @bernardo_pagnon
The way you have built a process is wrong. You train a model on the whole data and then use already labelled data with the trained model once again, this would generate some unexpected output:
In the simplest case you should split data before training a model, so a model is trained on, say, 80% of the data and then is applied on other 20% of examples:
Vladimir
http://whatthefraud.wtf
Dear Peter,
you nailed it! I would never have figure it our for myself. I updated the compatibility for 8.2 and now 0.5 is the default threshold! Thank you so much!
Best,
Bernardo
Dear Vladmir,
thank you for your reply. I agree that testing the model with the training data is not a good practice, but it is not wrong. By splitting the data I obtained the same error, so that was not the cause.
Best,
Bernardo
Hi @bernardo_pagnon
Okay my guess about the regression thresholds was really not correct, but glad @phellinger has provided this nice solution
Though I still should warn you about applying the model on a training set, it is totally possible to do this technically, but it still does not make sense as if you measure the performance then you'll get a perfect overfit model at the end, for example:
Vladimir
http://whatthefraud.wtf