"Performance Score different than actual prediction"
I am creating a model to predict customer churn ('Y', 'N') and by using Gradient Boosted Trees I was able to predict with an 87% if the customer has churn status 'Y' and 68% if the customer has churn status 'N' according to the Cross Validation process. The thing is, when I add unlabeled data to run predictions on, the actual predictions are not accurate at all. If the model says it can predict with an 87% accuracy if a customer is a churned customer, shouldn't the actual predictions be 87%? Is there a way for me to choose what I want the prediction to focus on with the unlabeled data so that I can get a closer prediction to the cross validation score? (In this case churn = 'Y')
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve churnmodel1" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/churnmodel1"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="h2o:gradient_boosted_trees" compatibility="8.2.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="112" y="34">
<list key="expert_parameters"/>
</operator>
<connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
<connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve churnmodel1" from_port="output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<background height="232" location="//Samples/Tutorials/Modeling, Scoring, and Validation/04/tutorial4" width="1502" x="26" y="47"/>
</process>
</operator>
</process>
Answers
Hi @rafael_s_moceli,
If I understand, you are using in reality a labeled dataset (that you did not use for training your model)
to test and evaluate your model to obtain an "actual" prediction and an associated "actual" performance ? Right ?
"Usually" we consider that the performance given by the cross-validation is "representative" of the future performance of the model on "unseen data". However there are cases where it is not true. For example when your model is overfitting. In this case, your model has a (relativ) good performance on the training set but a bad performance on the test set : your model do not generalize enough.
Anyway, to provide elements of answers to your particular case, can you share your dataset(s) and model (in order to reproduce what you get).
Regards,
Lionel
Thanks for the reply.
My data set was a lot bigger than the one I am posting but I couldnt post because of the size.
I am using gradient boosted trees model and I pasted my XML process in the first post.
I am using the churn column as label role. There is another column (valor_desconto) where I have to change the type of data to polynomial.
I wonder if the fact that I am categorizing a lot of data into groups and changing them into polynomial is having a negative effect in prediction? I actually did this because I read it would be beneficial.
Also, I put the proportion of churn 'Y' and 'N' as equal in this model, but when I make the proportion of churn 'Y' bigger the prediction for Y becomes better but worst overall.