Naive Bayes - Execute Python vs RM : different AUC

lionelderkrikor · December 2017

Hi,

I continue my experiments on RM/Execute Python with the NB model.

Sorry, but I feel obliged to appeal to you :

mschmitz , that is with numerical examples for both model RM and execute Python.

Indeed, I retrieve in both models strictly the same scoring results (accuracy, weighted mean recall, weighted mean precision, recall (positive class no/yes), precision (positive class no/yes) ) except..... for the AUC :

AUC(RM)= 0.942

AUC(Python) = 0.883

I suppose that the AUC is calculated from the ROC curve.

But how it is calculated ?. How explain this difference?

Here the process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="136">
        <parameter key="repository_entry" value="//Samples/data/Deals"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (3)" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Future Customer"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="85"/>
      <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="514" y="85"/>
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="340">
        <parameter key="repository_entry" value="//Samples/data/Deals"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="340">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Future Customer"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="340">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Future Customer"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="coding_type" value="unique integers"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="166" name="Build / Apply model" width="90" x="514" y="289">
        <parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.calibration import CalibratedClassifierCV&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;from sklearn.metrics import precision_score&#10;from sklearn.metrics import roc_auc_score&#10;from sklearn import metrics&#10;&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  # Build the model&#10;  X = data.iloc[:,1:]&#10;  y = data.iloc[:,0]&#10;  NB = GaussianNB()&#10;  NB.fit(X,y)&#10;&#10;  NB_Calib = CalibratedClassifierCV(base_estimator = NB,method = 'sigmoid') &#10;&#10;  NB_Calib.fit(X,y)&#10;&#10;  #Calculate probability of each class.&#10;&#10;  pr = NB.class_prior_ &#10;  &#10;  #Calculate mean of each feature per class&#10;  th= NB.theta_&#10;&#10;  #Apply the model&#10;  y_pred = NB.predict(X)&#10;  y_prob = NB_Calib.predict_proba(X) &#10;  &#10;  &#10;  # Calculate the scoring&#10;  &#10;  #confusion matrix&#10;  conf_matrix = confusion_matrix(y,y_pred)&#10;  &#10;  #accuracy&#10;  acc_score = 100*accuracy_score(y,y_pred)  &#10;  &#10;  #weighted recall &#10;  reca_score = 100*recall_score(y,y_pred,average = 'weighted')&#10;  &#10;  #weighted precision&#10;  precisionscore = 100*precision_score(y,y_pred,average='weighted') &#10;&#10;  #recall (positive class : yes / positive class : no ) &#10;  reca_no = 100*recall_score(y,y_pred,average =None)&#10;&#10;  #precision (positive class : yes / positive class : no ) &#10;  precision_no = 100*precision_score(y,y_pred,average=None) &#10;   &#10;  #AUC (positive class : no) &#10;  AUCscore = roc_auc_score(y,y_pred,average=None) &#10;&#10;  #AUC (positive class : no) méthode n°2&#10;  fpr, tpr, thresholds = metrics.roc_curve(y, y_pred, pos_label=1)&#10;  AUC_2 = metrics.auc(fpr, tpr)&#10;  &#10;  &#10;  #Write the y_pred and scores in dataframe&#10;  &#10;  y_prediction = pd.DataFrame(data = y_pred,columns = ['prediction(Future Customer)'])&#10;  y_probability = pd.DataFrame(data = y_prob,columns = ['confidence(yes)','confidence(no)'])&#10;  data = data.join(y_prediction)&#10;  data = data.join(y_probability)&#10;&#10;  &#10;  accu_score = pd.DataFrame(data = [acc_score],columns = ['accuracy'])&#10;  recall_weighted = pd.DataFrame(data = [reca_score],columns = ['weighted_mean_recall']) &#10;  precision_weighted = pd.DataFrame(data = [precisionscore],columns = ['weighted_mean_precision'])  &#10;  recall_no = pd.DataFrame(data = [reca_no],columns = ['recall (positive class : yes)','recall (positive class : no)'])&#10;  precision_no = pd.DataFrame(data = [precision_no],columns = ['precision (positive class : yes)','precision (positive class : no)'])&#10;  AUC = pd.DataFrame(data = [AUCscore],columns = ['AUC'])&#10;  AUC2 = pd.DataFrame(data = [AUC_2],columns = ['AUC_method2'])&#10;  score = accu_score.join(recall_weighted)&#10;  score = score.join(precision_weighted)&#10;  score = score.join(recall_no)&#10;  score = score.join(precision_no)&#10;  score = score.join(AUC)&#10;  score = score.join(AUC2)&#10;  &#10;  theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10;  proba = pd.DataFrame(data =  pr, columns = ['probability'])&#10;  &#10;  confus_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no']) &#9;&#10;   &#10;  #data.rm_metadata['prediction(Future Customer)']=(None,'prediction(Future Customer)')&#10;&#10;  &#10;    # connect 4 output ports to see the results&#10;  return score,theta, confus_matrix,proba,data"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="648" y="85">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="782" y="85"/>
      <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance (2)" width="90" x="916" y="136"/>
      <operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="916" y="34">
        <parameter key="weighted_mean_recall" value="true"/>
        <parameter key="weighted_mean_precision" value="true"/>
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical (3)" to_port="example set input"/>
      <connect from_op="Nominal to Numerical (3)" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
      <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Build / Apply model" to_port="input 1"/>
      <connect from_op="Build / Apply model" from_port="output 1" to_port="result 1"/>
      <connect from_op="Build / Apply model" from_port="output 2" to_port="result 2"/>
      <connect from_op="Build / Apply model" from_port="output 3" to_port="result 3"/>
      <connect from_op="Build / Apply model" from_port="output 4" to_port="result 4"/>
      <connect from_op="Build / Apply model" from_port="output 5" to_port="result 7"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_op="Performance (2)" to_port="labelled data"/>
      <connect from_op="Performance (2)" from_port="performance" to_port="result 8"/>
      <connect from_op="Performance" from_port="performance" to_port="result 5"/>
      <connect from_op="Performance" from_port="example set" to_port="result 6"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
      <portSpacing port="sink_result 7" spacing="0"/>
      <portSpacing port="sink_result 8" spacing="0"/>
      <portSpacing port="sink_result 9" spacing="0"/>
    </process>
  </operator>
</process>

Thanks you,

Best regards,

Lionel

MartinLiebig · December 2017

Hi @lionelderkrikor,

i think one of the main differences is this line:

  NB_Calib = CalibratedClassifierCV(base_estimator = NB,method = 'sigmoid')

I am not sure exactly what it does, but it changes the confidences. RM is not doing that in his X-Val. Thus it is expected to get different results.

A fairer comparison of AUC itself would be to use the example set which was scored in RM and calculate AUC in Python and RM. There are always slight differences in how you calculate AUC - but your difference is a bit huge for this. I would expect the line above to influence the difference more.

Cheers,

Martin

Telcontar120 · December 2017

It likely has to do with the way that ties are handled, because there are multiple options for that when calculating ROC/AUC and not all software uses the same method. You'll either have to dive into the details of the ROC/AUC calculations in python vs RapidMiner (via the java code on github), or maybe one of the developers will chime in because they already know the answer :-)

lionelderkrikor · December 2017

Hi @mschmitz

Here two elements :

1. Probability calibration :

Recently, during my experimentations of comparaisons Python/RM, I was too interested in

NB_Calib = CalibratedClassifierCV(base_estimator = NB,method = 'sigmoid')

In deed, first, the calculated confidences by the model (SVM) in Python were abberant (for the predicted class, the confidence was < 0.5 for a binary problem !!!???). After investigations, I discover this python class which seems to improve the relevance of classifiers confidences. So I builded a SVM model (strictly the same both python/RM) and used this class to calculated the new confidences in Python : There were differents from RM.

To go further :

http://scikit-learn.org/stable/modules/calibration.html

To confirm, with the NB model, in the following process , I applied too with Execute Python the class above : The confidences from "Execute Python" are indeed differents from confidences of RM. (the training example set Chapter09DataSet_Training.csv in attached file)

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Chapter09DataSet_Training" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Rapidminer_Tests/data/Chapter09DataSet_Training"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
        <parameter key="attribute_name" value="2nd_Heart_Attack"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
      <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="447" y="85"/>
      <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="648" y="34"/>
      <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="782" y="136"/>
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Chapter09DataSet_Training (2)" width="90" x="45" y="442">
        <parameter key="repository_entry" value="//Rapidminer_Tests/data/Chapter09DataSet_Training"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role (2)" width="90" x="179" y="442">
        <parameter key="attribute_name" value="2nd_Heart_Attack"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="313" y="544"/>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Naive Bayes Python" width="90" x="447" y="442">
        <parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.calibration import CalibratedClassifierCV&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;    y = data['2nd_Heart_Attack']&#10;    X = data.drop('2nd_Heart_Attack',axis = 1)&#10;&#10;    #List of attributes&#10;    features = list(X)&#10;&#10;    #Build the model&#10;    model_NB = GaussianNB()&#10;    model_NB.fit(X,y)&#10;&#10;    #Create de calibrated model&#10;    model_NB_calib =CalibratedClassifierCV(model_NB,method = 'sigmoid')&#10;    model_NB_calib.fit(X,y)&#10;&#10;    #Calculation of distribution table (mean) &#10;    th = model_NB.theta_&#10;    th_1 = th[0,:]&#10;    th_2 = th[1,:]&#10;&#10;    #Calculation of distribution table (stv) &#10;    std = model_NB.sigma_&#10;    std_1 = (std[0,:])**0.5&#10;    std_2 = (std[1,:])**0.5&#10;&#10;&#10;    #Write the results&#10;    theta_2 = pd.DataFrame(data = th_2,columns = ['Yes (main)'])&#10;    theta_1 = pd.DataFrame(data = th_1,columns = ['No (main)']) &#10;    sigma_2 = pd.DataFrame(data = std_2,columns = ['Yes (std)'])&#10;    sigma_1 = pd.DataFrame(data = std_1,columns = ['No (std)']) &#10;    &#10;    theta = pd.DataFrame(data = features,columns = ['Attribute'])&#10;    theta = theta.join(theta_2)&#10;    theta = theta.join(sigma_2)&#10;    theta = theta.join(theta_1) &#10;    theta = theta.join(sigma_1)&#10;    &#10;&#10;    # connect 1 output port to see the results&#10;    return model_NB_calib,theta"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply model Python (2)" width="90" x="581" y="544">
        <parameter key="script" value="import pandas as pd&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import roc_auc_score&#10;from sklearn.preprocessing import LabelEncoder&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(model,data):&#10;&#10;    y = data['2nd_Heart_Attack']&#10;    X = data.drop('2nd_Heart_Attack',axis = 1)&#10;&#10;    #Prediction : Applying of the model&#10;    y_pred = model.predict(X)&#10;    y_prob = model.predict_proba(X)&#10;&#10;    #Transform (Yes/No) ==&gt; (0/1) mandatory for python&#10;    le = LabelEncoder()&#10;&#10;    y_bin = le.fit_transform(y)&#10;    y_pred_bin = le.fit_transform(y_pred)&#10;&#9;&#10;   #Calculation of accuracy&#10;    acc = 100*accuracy_score(y,y_pred)&#10;    #Calculation of AUC&#10;    auc_ = roc_auc_score(y_bin,y_pred_bin,average = 'weighted')&#10;&#10;    #Write the results&#10;    data['prediction(2nd_Heart_Attack)'] = y_pred&#10;    data['confidence(Yes)'] = y_prob[:,1]&#10;    data['confidence(No)'] = y_prob[:,0]&#10;&#10;    performance = pd.DataFrame(data = [acc],columns = ['accuracy'])&#10;    AUC = pd.DataFrame(data = [auc_],columns = ['AUC'])&#10;    performance = performance.join(AUC)&#10;    &#10;    data.rm_metadata['prediction(2nd_Heart_Attack)']=(None,'prediction(2nd_Heart_Attack)')&#10;    data.rm_metadata['confidence(Yes)'] = (None,'confidence(Yes)')&#10;    data.rm_metadata['confidence(No)'] = (None,'confidence(No)')&#10;    &#10;    &#10;    # connect 2 output ports to see the results&#10;    return data, performance"/>
      </operator>
      <connect from_op="Retrieve Chapter09DataSet_Training" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 5"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_port="result 2"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Retrieve Chapter09DataSet_Training (2)" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Multiply (3)" to_port="input"/>
      <connect from_op="Multiply (3)" from_port="output 1" to_op="Naive Bayes Python" to_port="input 1"/>
      <connect from_op="Multiply (3)" from_port="output 2" to_op="Apply model Python (2)" to_port="input 2"/>
      <connect from_op="Naive Bayes Python" from_port="output 1" to_op="Apply model Python (2)" to_port="input 1"/>
      <connect from_op="Naive Bayes Python" from_port="output 2" to_port="result 6"/>
      <connect from_op="Apply model Python (2)" from_port="output 1" to_port="result 3"/>
      <connect from_op="Apply model Python (2)" from_port="output 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
      <portSpacing port="sink_result 7" spacing="0"/>
    </process>
  </operator>
</process>

2. The ROC curve :

In parallel, I builded the ROC curve with Python and it's weird :

Python is using only one point for creating the ROC. Here a screenshot of this ROC :

NB ROC python curve

While RM is using a lot more points :

NB_ROC_RM curve

The number of points taken into account is not the same in both cases. RM is more accurate than Python and then

the two curves have not the same "shape" and then the Area Under Curve is different. For me, there is a "bug" or at least

a simplification/a lack of precision in Python.

Best regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Naive Bayes - Execute Python vs RM : different AUC

Best Answer

Answers