Understand GBT Model Output
Hello,
Any help in this matter would be really appreciated.
I am using GBT operator to train my model on a customer churn example set. I received approx. 80 % accuracy with GBT Model. Now my issue is to how do I related this GBT model output with business processes.
How should I communicate the GBT results with business folks to understand why specific customer is churn and what variables contributed to Terminated status instead of Active customer status.
Another question I have in mind is, How do I calculate the threshold variable limits that make customers to change their mind? That way we can watchful on certain metrics to prevent churn.
Here is the result from GBT model
Model Metrics Type: Binomial
Description: N/A
model id: rm-h2o-model-gradient_boosted_trees-422159
frame id: rm-h2o-frame-gradient_boosted_trees-324798
MSE: 0.10739042
R^2: 0.5584855
AUC: 0.9389837
logloss: 0.35373378
CM: Confusion Matrix (vertical: actual; across: predicted):
Active Terminated Error Rate
Active 590 139 0.1907 = 139 / 729
Terminated 53 470 0.1013 = 53 / 523
Totals 643 609 0.1534 = 192 / 1,252
Gains/Lift Table (Avg response rate: 41.77 %):
Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain |
1 0.01038339 0.926587 2.393881 2.393881 1.000000 1.000000 0.024857 0.024857 139.388145 139.388145 |
2 0.02076677 0.926248 2.393881 2.393881 1.000000 1.000000 0.024857 0.049713 139.388145 139.388145 |
3 0.03035144 0.926021 2.393881 2.393881 1.000000 1.000000 0.022945 0.072658 139.388145 139.388145 |
4 0.04073482 0.925124 2.393881 2.393881 1.000000 1.000000 0.024857 0.097514 139.388145 139.388145 |
5 0.05111821 0.924748 2.393881 2.393881 1.000000 1.000000 0.024857 0.122371 139.388145 139.388145 |
6 0.10063898 0.913532 2.393881 2.393881 1.000000 1.000000 0.118547 0.240918 139.388145 139.388145 |
7 0.15015974 0.872454 2.393881 2.393881 1.000000 1.000000 0.118547 0.359465 139.388145 139.388145 |
8 0.20047923 0.754298 2.355883 2.384344 0.984127 0.996016 0.118547 0.478011 135.588333 138.434408 |
9 0.30031949 0.570023 1.953407 2.241081 0.816000 0.936170 0.195029 0.673040 95.340727 124.108051 |
10 0.40015974 0.429297 1.378876 2.025960 0.576000 0.846307 0.137667 0.810707 37.887572 102.595955 |
11 0.50000000 0.326709 0.957553 1.812620 0.400000 0.757188 0.095602 0.906310 -4.244742 81.261950 |
12 0.59984026 0.267012 0.459625 1.587421 0.192000 0.663116 0.045889 0.952199 -54.037476 58.742072 |
13 0.69968051 0.227460 0.344719 1.410095 0.144000 0.589041 0.034417 0.986616 -65.528107 41.009455 |
14 0.80031949 0.103437 0.132993 1.249501 0.055556 0.521956 0.013384 1.000000 -86.700659 24.950100 |
15 0.90095847 0.068919 0.000000 1.109929 0.000000 0.463652 0.000000 1.000000 -100.000000 10.992908 |
16 1.00000000 0.057902 0.000000 1.000000 0.000000 0.417732 0.000000 1.000000 -100.000000 0.000000 |
Variable |
Relative Importance |
Scaled Importance |
Percentage |
Field1 |
445.525879 |
1 |
0.49061 |
Field2 |
158.352005 |
0.355427 |
0.174376 |
Field3 |
93.245522 |
0.209293 |
0.102681 |
Field4 |
51.406567 |
0.115384 |
0.056609 |
Field5 |
34.961025 |
0.078471 |
0.038499 |
Field6 |
26.576853 |
0.059653 |
0.029266 |
Field7 |
19.5725 |
0.043931 |
0.021553 |
Field8 |
19.506002 |
0.043782 |
0.02148 |
Field9 |
19.407133 |
0.04356 |
0.021371 |
Field10 |
13.182694 |
0.029589 |
0.014517 |
Field11 |
11.111937 |
0.024941 |
0.012236 |
Field12 |
4.461669 |
0.010014 |
0.004913 |
Field13 |
3.955152 |
0.008877 |
0.004355 |
Field14 |
3.564302 |
0.008 |
0.003925 |
Field15 |
3.276087 |
0.007353 |
0.003608 |
Field16 |
0 |
0 |
0 |
Thank You
Answers
Short answer: You can't gain any intuituion from GBT. GBT is an ensemble of trees (sometimes of hundreds of trees); so it is really difficult to interpret it.
I've seen in other software (I can't remember which one) that you hold k-1 variables constant and you change one variable and you plot the forecast of the GBT. Then you can visualize what type of relation exists between label and attribute.
With respect to your second question: Before you find an optimal threshold you have to specify the costs of making mistakes in your classification. Once you know those costs you can use operators like "Find threshold" to solve for the optimal T.
I just came back from giving RapidMiner training and a similar question was raised in the training. How do you explain a complex algorithm like a Neural Net or GBT in laymans terms to a business group? It's hard, especially if the algorithm can handle highly dimensional data or is just complex in it's working.
In your case, explaining it might be a bit easier than say a Neural Net. Everyone understands a decision tree, so you can say that GBT is like a decision tree but better because it generates many more trees (like a Random Forest) and has some special characteristics to help convert your 'weak' hypothesisses into 'stronger' hypothesis. There's a great high level overview on GBT here: http://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
The model output does also provide some insight into variable importance. It's this section below. It won't tell you why a specific case has the prediction that it does, but at least it gives you an overall sense of which attributes and their relative strengths are most important in the predictions from that GBT model:
Variable
Relative Importance
Scaled Importance
Percentage
Field1
445.525879
1
0.49061
Field2
158.352005
0.355427
0.174376
Field3
93.245522
0.209293
0.102681
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hey,
this sounds pretty much like a use case for my Get Local Interpretation operator which is available in operator toolbox. Have a look on it.
If this fits your needs, I am happy to have a look this personally.
Best,
Martin
Dortmund, Germany
Hello,
Thank you for your response. it make sense. but how do you define "the costs of making mistakes in classification". It wil be very helpful, if you can share little more insight on this topic. I will explore this context and see if that helps me identify the threshold value of any specific variable with respect to Lable variable.
Thanks!
Shraddha
I'll give you an example: the classical example of mailing an offer (a "catalog") to a customer
You send the catalog to 1000 people at random and now you want to develop a model to decide who you should send it to in the general population. If you send the catalog and the customer buys from it, from gain $10 (this is net of all costs including the catalog). If she does not buy anything you lose the cost of the catalog (say $1).
How would you decide what cut-off probability to use to decide to whom you should send a catalog?
DECISION 1: Mail the catalog
With probability p you make $10 and with probability (1-p) you lose $1. Expected Value = 10*p - 1*(1-p) = 11*p - 1
DECISON 2: Don't mail it.
Then with certainty you will make $0. Expected Value = 0
You should mail when expected value of Decision 1 is greater than EV of Decision 2.
When : 11*p - 1 > 0
Or when : p > 1/11
That's you optimal cut-off point. It does not maximize "accuracy", but you don't care about "accuracy" you care about profit.
You can construct different examples of the same type and find in each case that the optimal p is different from the default p=0.5. Of course, if the costs are symmetric, then p*=0.5.
The other example I was going to give you is the problem of classifying a transaction as fraudulent or not. I have a dataset with 300,000 transactions in a day. Only 500 are fraudulent. Think about the assymetric costs of this example.
And there is a Performance(Costs) operator that allows you to enter these type of asymmetric costs in RapidMiner and optimize your model directly on those costs. Check it out!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for the explanation.
so should i need to identify the cost of every data points which is being used to define the model? or you are saying we need to define the cost only on the Label variable?
I have an impression from the explanation and after trying the operator, you are saying to define the cost on the Label variable. If that is true, I am more interested to identify the threshold value for each important variable in the model. so that I can explain the business users that if this variable reached to particular band, it affect the client decision. Let me know if this make sense.
Thanks again for all your help and taking time to look into my question.
Regards,
Shraddha
Thank you Martin, I think this operator may provide little more insight to understand model result with respct to business problem.
I will explore this operator more. Thanks for pointing it out to me.
Regards,
Shraddha
Hello Martin,
I tried the " Get Local Interpretation" operator on GLM model. I need little help on undertsanding, how to interpret the resuls from this operator. After running the model, I recieve the important attributes and there importance for each client. I see the positive and negative Corfficent value, does positive and negative has any meaning to Lable attribute?
Your reponse is appreciated. attached is the sample output from Get Local Interpretation operator
Hello,
the sign does mean something. The output result is always depended on the used weighting scheme. I think you are using a GLM internally, so this means that the negative sign here means that this is a strong indicator for your negative class.
Best,
Martin
Dortmund, Germany
Thanks, I really appreciate your quick response. Your explanation helps a lot.
I am under impression that, if the label is negative class, all the important attribute and the importance value should be negative. But I am not seeing the same in my example. Infect I see only 1 or 2 attribute has a negative coefficient , rest are positive coefficient. So just wondering how should I interpret the results.
Thank you so much for taking out time to help me understand.
Regards,
Shraddha
Dear Shraddha,
i think we need to be a bit more careful in interpeting those attributes.
The weights of GLM are the coefficients of the derived regression formular, i.e:
y = a*att_1+b*att_2+c*att3....
y is in our case 1 for the positive and 0 for the negative class.
If a is now positive, it means that the y value will get bigger. Meaning the outcome will be more likely positive. But you could also interpret it the otherway around. The abscence of att_1 is an indicator for a negative outcome.
Let's have a plastic example, we want to derive the salary from age and get:
Salary = 1000*age
Age has thus a high weight for high salary, it has a positive sign. On the other hand you can also interpret it as: LOW age has a high depedency to SMALL sallary.
Second point to raise is, that Local Interpretation has a locality parameter which might influence the result.
Best,
Martin
Dortmund, Germany