The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Naive Bayes - Execute Python vs RM : same model / different scoring results

lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited June 2019 in Help

Hi,

 

I'm doing some experimentations on RM :  I compare the results of "Naive-Bayes" operator

and those obtained from "Execute Python" operator using the "Deals" dataset.

In "Execute Python", the building, applying of the model and calculation of scoring are performed using sckit-learn.

For RM,  I use the "Naive-Bayes" operator and the "Cross-Validation" operator.

 

After executing the process, something is weird : 

 - I have in both cases, strictly the same "Distribution Table" (so I think the builded model is the same in both cases)

but 

 - The confusion matrix, the mean accuracy, the weighted mean recall and the weighted mean precision are systematically differents : the confusion matrix are differents and the performances of RM (~92%) are greater than those Execute Python (~88%) on the same dataset.

 

Here you can find my process : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="179" y="34"/>
<connect from_port="training set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="289">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="coding_type" value="unique integers"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Build / Apply model" width="90" x="514" y="238">
<parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.model_selection import cross_val_score&#10;from sklearn.model_selection import train_test_split&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; # Build the model&#10; X = data.iloc[:,1:]&#10; y = data.iloc[:,0]&#10; NB = GaussianNB()&#10; NB.fit(X,y)&#10;&#10; #Calculate probability of each class.&#10;&#10; pr = NB.class_prior_ &#10; &#10; #Calculate mean of each feature per class&#10; th= NB.theta_&#10;&#10; #Apply the model&#10; y_pred = NB.predict(X)&#10; &#10; &#10; # Calculate the scoring&#10; conf_matrix = confusion_matrix(y,y_pred)&#10; &#10; acc_score_mean = (cross_val_score(NB, X, y,cv = 10, scoring = 'accuracy' )).mean()&#10; acc_score_std = (cross_val_score(NB, X, y,cv = 10, scoring = 'accuracy' )).std()&#10; acc_score = str(100* acc_score_mean) + &quot; +/- &quot; + str( 100* acc_score_std) &#10; &#10; reca_score_mean = (cross_val_score(NB, X, y,cv = 10, scoring = 'recall_weighted' )).mean()&#10; reca_score_std = (cross_val_score(NB, X, y,cv = 10, scoring = 'recall_weighted' )).std()&#10; reca_score = str(100* reca_score_mean) + &quot; +/- &quot; + str( 100* acc_score_std) &#10; &#10; precision_score_mean = (cross_val_score(NB, X, y,cv = 10, scoring = 'precision_weighted' )).mean()&#10; precision_score_std = (cross_val_score(NB, X, y,cv = 10, scoring = 'precision_weighted' )).std()&#10; precision_score = str(100* precision_score_mean) + &quot; +/- &quot; + str( 100* precision_score_std ) &#10; &#10; #Write the scores in dataframe&#10; accu_score = pd.DataFrame(data = [acc_score],columns = ['accuracy'])&#10; recall_weighted = pd.DataFrame(data = [reca_score],columns = ['weighted_mean_recall']) &#10; precision_weighted = pd.DataFrame(data = [precision_score],columns = ['weighted_mean_precision']) &#10; score = accu_score.join(recall_weighted)&#10; score = score.join(precision_weighted)&#10; &#10; theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10; proba = pd.DataFrame(data = pr, columns = ['probability'])&#10; &#10; &#10; confus_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no']) &#9;&#10; &#10;&#10; # connect 4 output ports to see the results&#10; return score,theta, confus_matrix,proba"/>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 5"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 3"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 6"/>
<connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
<connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Build / Apply model" to_port="input 1"/>
<connect from_op="Build / Apply model" from_port="output 1" to_port="result 1"/>
<connect from_op="Build / Apply model" from_port="output 2" to_port="result 2"/>
<connect from_op="Build / Apply model" from_port="output 3" to_port="result 4"/>
<connect from_op="Build / Apply model" from_port="output 4" to_port="result 7"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
</process>
</operator>
</process>

NB : I think there is a bug in sckit-learn, the mean accuracy and the weighted mean recall are strictly and systematically  equal. (tested on other datasets).

 

How can we explain these mysterious results ?

 

Thanks you for your explanation,

 

Regards,

 

Lionel

 

 

 

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Solution Accepted

    Hi Lionel,

     

    you need to compare apples with apples. You hand over a numericalized example set to Python. The python NB assumes Gaussian distribution for all attributes.

    In RM you hand over partly numerical partly nominal data. RM also assumes gaussian data for the numiercal parts, but for the nominal we get the probability from the ratios (20% of the data are female => p=0.2).

     

    If you use all numerical everywhere you get the same results (see attached process). I think sklearn is not able to handle nominals the same correct way we do.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="238">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="313" y="238"/>
    <operator activated="true" breakpoints="after" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="514" y="289">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="coding_type" value="unique integers"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="514" y="34"/>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Build / Apply model" width="90" x="715" y="289">
    <parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;from sklearn.metrics import precision_score&#10;&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; # Build the model&#10; X = data.iloc[:,1:]&#10; y = data.iloc[:,0]&#10; NB = GaussianNB()&#10; NB.fit(X,y)&#10;&#10; #Calculate probability of each class.&#10;&#10; pr = NB.class_prior_ &#10; &#10; #Calculate mean of each feature per class&#10; th= NB.theta_&#10;&#10; #Apply the model&#10; y_pred = NB.predict(X)&#10; &#10; &#10; # Calculate the scoring&#10; &#10; #confusion matrix&#10; conf_matrix = confusion_matrix(y,y_pred)&#10; &#10; #accuracy&#10; acc_score = 100*accuracy_score(y,y_pred) &#10; &#10; #recall&#10; reca_score = 100*recall_score(y,y_pred,average='weighted') &#10; &#10; #precision&#10; precisionscore = 100*precision_score(y,y_pred,average='weighted') &#10; &#10; #Write the scores in dataframe&#10; accu_score = pd.DataFrame(data = [acc_score],columns = ['accuracy'])&#10; recall_weighted = pd.DataFrame(data = [reca_score],columns = ['weighted_mean_recall']) &#10; precision_weighted = pd.DataFrame(data = [precisionscore],columns = ['weighted_mean_precision']) &#10; score = accu_score.join(recall_weighted)&#10; score = score.join(precision_weighted)&#10; &#10; theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10; proba = pd.DataFrame(data = pr, columns = ['probability'])&#10; &#10; confus_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no']) &#9;&#10; &#10;&#10; # connect 4 output ports to see the results&#10; return score,theta, confus_matrix,proba"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="849" y="85">
    <parameter key="weighted_mean_recall" value="true"/>
    <parameter key="weighted_mean_precision" value="true"/>
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Multiply (3)" to_port="input"/>
    <connect from_op="Multiply (3)" from_port="output 1" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Multiply (3)" from_port="output 2" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Build / Apply model" to_port="input 1"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Naive Bayes" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Build / Apply model" from_port="output 1" to_port="result 1"/>
    <connect from_op="Build / Apply model" from_port="output 2" to_port="result 2"/>
    <connect from_op="Build / Apply model" from_port="output 3" to_port="result 3"/>
    <connect from_op="Build / Apply model" from_port="output 4" to_port="result 4"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 6"/>
    <connect from_op="Performance" from_port="performance" to_port="result 5"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Another one of your interesting comparisons! :smileyhappy:

    Did you also perform cross-validation in Python (you didn't mention it)? If not, then you aren't doing exactly the same thing.  There is randomness that gets introduced into cross-validation results because of the random sampling.

    A cleaner comparison would be to build the NB model on the full dataset in both without cross-validation, or with a specific holdout split validation sample (the same one in both, not using random sampling).  Then compare those results. 

    A further note on the effects of randomization---unless you are using the "local random seed" parameter, even in RapidMiner your results may vary in different runs of the same process.  In RapidMiner you use that parameter to ensure reproducibility over time.  I suspect python has some kind of similar setting but I am not sure how it works.

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi,

    Yes, I continue my learning of Data science, RM and Python  and thus indeed a new comparison:smileyvery-happy:

     

    1. "Did you also perform cross-validation in Python (you didn't mention it)?"

    Yes I perform in Python a 10-fold cross validation (and the same in RM)

     

    2."A cleaner comparison would be to build the NB model on the full dataset in both without cross-validation"

    Indeed, you're right, i have to perform comparisons strictly : 

    I trained in both cases the NB model on the full dataset and then applied this model to the same full dataset. I observe the same behaviours. The 2 models are strictly the same (same "Distribution table") and the confusion matrix are different and the performances of the RM model (~92.5%) are greater than those of "Execute Python"  (~88.5%).

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="34"/>
    <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="313" y="34"/>
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="238">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="coding_type" value="unique integers"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Build / Apply model" width="90" x="514" y="238">
    <parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;from sklearn.metrics import precision_score&#10;&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; # Build the model&#10; X = data.iloc[:,1:]&#10; y = data.iloc[:,0]&#10; NB = GaussianNB()&#10; NB.fit(X,y)&#10;&#10; #Calculate probability of each class.&#10;&#10; pr = NB.class_prior_ &#10; &#10; #Calculate mean of each feature per class&#10; th= NB.theta_&#10;&#10; #Apply the model&#10; y_pred = NB.predict(X)&#10; &#10; &#10; # Calculate the scoring&#10; &#10; #confusion matrix&#10; conf_matrix = confusion_matrix(y,y_pred)&#10; &#10; #accuracy&#10; acc_score = 100*accuracy_score(y,y_pred) &#10; &#10; #recall&#10; reca_score = 100*recall_score(y,y_pred,average='weighted') &#10; &#10; #precision&#10; precisionscore = 100*precision_score(y,y_pred,average='weighted') &#10; &#10; #Write the scores in dataframe&#10; accu_score = pd.DataFrame(data = [acc_score],columns = ['accuracy'])&#10; recall_weighted = pd.DataFrame(data = [reca_score],columns = ['weighted_mean_recall']) &#10; precision_weighted = pd.DataFrame(data = [precisionscore],columns = ['weighted_mean_precision']) &#10; score = accu_score.join(recall_weighted)&#10; score = score.join(precision_weighted)&#10; &#10; theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10; proba = pd.DataFrame(data = pr, columns = ['probability'])&#10; &#10; confus_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no']) &#9;&#10; &#10;&#10; # connect 4 output ports to see the results&#10; return score,theta, confus_matrix,proba"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="581" y="85">
    <parameter key="weighted_mean_recall" value="true"/>
    <parameter key="weighted_mean_precision" value="true"/>
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Deals" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Build / Apply model" to_port="input 1"/>
    <connect from_op="Build / Apply model" from_port="output 1" to_port="result 1"/>
    <connect from_op="Build / Apply model" from_port="output 2" to_port="result 2"/>
    <connect from_op="Build / Apply model" from_port="output 3" to_port="result 3"/>
    <connect from_op="Build / Apply model" from_port="output 4" to_port="result 4"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 6"/>
    <connect from_op="Performance" from_port="performance" to_port="result 5"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>

    3."or with a specific holdout split validation sample"

    I performed a split validation with a ratio training set of 0,7 in both cases and I observe the following strange behaviour : 

    The model trained by RM is always strictly the same whatever the training/test ratio (same distribution table) :

    It seems that the RM model is always trained with the whole dataset whatever the training/test ratio. ==> performance ~92.5%

     

    With Execute Python my model is different from case 2.(which seems logical) ==> Performance ~90%

    I don't know if we can compare both models in this case. However, we see that there is still a difference between the confusion matrix / performances of both models.

    NB : I use a random seed both in RM and Python for the split validation

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="split_validation" compatibility="8.0.001" expanded="true" height="124" name="Validation" width="90" x="447" y="136">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="179" y="34"/>
    <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <parameter key="weighted_mean_recall" value="true"/>
    <parameter key="weighted_mean_precision" value="true"/>
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="493">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="246" y="493">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="493">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Future Customer"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="coding_type" value="unique integers"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="124" name="Execute Python" width="90" x="514" y="493">
    <parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.model_selection import train_test_split&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;from sklearn.metrics import precision_score&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;# Building of the model&#10; X = data.iloc[:,1:]&#10; y = data.iloc[:,0]&#10;&#10; #Spliting the dataset in training/test sets&#10; X_train,X_test,y_train,y_test = train_test_split( X, y,test_size = 0.3, random_state=1992)&#10; &#10; NB = GaussianNB()&#10; &#10;&#10; NB.fit(X_train,y_train)&#10;&#10; th= NB.theta_&#10; &#10; #Applying the model&#10; y_pred = NB.predict(X_test)&#10;&#10; &#10; #Calculation of performances&#10; &#10;#confusion matrix&#10; conf_matrix = confusion_matrix(y_test,y_pred)&#10;&#10; #accuracy&#10; acc_score = 100*accuracy_score(y_test,y_pred)&#10;&#10; #recall&#10; reca = 100*recall_score(y_test, y_pred, average='weighted') &#10; &#10; #precision&#10; prec = 100*precision_score(y_test, y_pred, average='weighted') &#10; &#10; # Writing of performances&#10; confu_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no'])&#10; score = pd.DataFrame(data = [acc_score],columns = ['weighted mean accuracy'])&#10; recall_weighted = pd.DataFrame(data = [reca],columns = ['weighted mean recall']) &#10; precision_weighted = pd.DataFrame(data = [prec],columns = ['weighted mean precision']) &#10; score = score.join(recall_weighted)&#10; score = score.join(precision_weighted)&#10; &#10; theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10;&#9;&#10; &#10;&#10; # connect 3 output ports to see the results&#10; return score,confu_matrix,theta"/>
    </operator>
    <connect from_op="Retrieve Deals" from_port="output" to_op="Validation" to_port="training"/>
    <connect from_op="Validation" from_port="model" to_port="result 5"/>
    <connect from_op="Validation" from_port="training" to_port="result 1"/>
    <connect from_op="Validation" from_port="averagable 1" to_port="result 4"/>
    <connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 2"/>
    <connect from_op="Execute Python" from_port="output 2" to_port="result 3"/>
    <connect from_op="Execute Python" from_port="output 3" to_port="result 6"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>

    Can you help me to understand these mysterious results and behaviours,

     

    Best regards,

     

    Lionel

     

     

     

     

     

     

     

     

     

     

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi Martin,

     

    First, thanks you for sharing time to perform this analysis and to deliver these explanations.

    You're right : I have to compare what is comparable. In fact, I use numericalized examples for Python because sklearn, indeed,  is not able to handle nominals (In this case an error is raised and the process failed).  I did not think that using nominal examples instead of numerical examples improve the performances of the builded model.

    Concerning the split validation (although I know that cross-validation is to be favored ....), why the builded model is the same (strictly the same distribution table) what ever the split training/test  ratio ?

     

    Thanks you,

     

    Best regards,

     

    Lionel

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    pretty simple. Validations (Split, Bootstrap and X-Val) are only a tool to measure the true performance of the method. The model returned at the mod port is always the model on the full data set. This should be better than any of the validation model.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi, 

     

    OK,  that confirms the hypothesis I made.

     

    Thanks you,

     

    Best regards,

     

    Lionel

Sign In or Register to comment.