Determining which attributes contribute to value of a label
Hi Community, I'm currently running my first data mining project and I'm having some serious doubts. I hope I'm posting my question in the right place and some of you could help me with some kind of hint or advice. It may be obvious to you, but to me the differences between many tools and techniques is still very thin.
I have my data in an Oracle database, stored into 4 tables: CUSTOMERS, ACCOUNTS, TRANSACTIONS and ALERTS.
The common attribute for each of them is CUSTOMER_ID.
The attribute which is most "interesting" to me is called TRUE_POSITIVE, it's a column from table ALERTS, and it takes either value "Yes" or "No".
The GOAL of my project is to determine which of the attributes contribute the most to the value of TRUE_POSITIVE being = "Yes".
My dataset is moderate in size (maybe 50 attributes in total, tables having between 50k to 700k examples).
At this point I've imported my data in RapidMiner Studio and did some initial data clearing (rejected certain columns, filtered out examples with missing important attributes.etc.)
Many attributes are take binominal values (for example: CUSTOMERS.FACE_TO_FACE_IDENTIFIED), many are polynominal (for example: CUSTOMERS.NATIONALITY).
I've also created some new attributes in table CUSTOMERS, like NO_OF_ALERTS_POS, which stores the number of true positive alerts for the particular customer, or HR_CASHFLOW which stores customers' average monthly value of transactions made "with" high risk countries.
My main question is:
Which tool / operand should I use to achieve my goal? Correlation matrix? Regression?
And some additional questions:
What would be the optimal number of attributes? Does my current dataset require much dimensionality reduction?
Can I use my new attribues to avoid joining tables? Does it make any sense and is there big risk I will miss the change of detecting some unobvious correlations?
Many thanks in advance for your help.
Answers
Hi! Welcome to the boards. I moved your post to the RapidMiner Studio forum because you're using Studio.
Ok, what your task is really a standard classification analysis. You're trying to use the data you cleaned to learn any patterns that makes one record a "Yes" or "No."
What I would suggest is to use a predefined Cross Validation building block (right click in the design canvas, select Insert Building Block, insert Nominal Cross Validation). The default algorithm is a Decision Tree (double click to see inside) and if you run it, it will output a confusion matrix and tell you how good that algorithm was able to discen between Yes and No. You can swap out the Decision Tree with maybe a Logistic Regression or some other algorithms and test again to see which gives you a better model.
Why do we try different algos? It's because some algos perform better on different data sets. So, it becomes an iterative process sometimes. With RapidMiner, it's simple to swap out different algos and once you become more adavanced you can build a process to auto model and auto select the best algorithm. We've been doing automdeling, tuning, and selection for years but never talked about.
Most likely you might need dimensionaluity reduction if your data set is really wide. There are different techniques to do it but first try the Cross Validation one and we'll go from there.
Thomas, thanks for your reponse!
I've tried few algorithms, results below. Could you help me interpret them?
Decision Tree:
accuracy: 93.77% +/- 0.01% (mikro: 93.77%)
ConfusionMatrix:
True: Nie Tak
Nie: 42254 2806
Tak: 0 0
precision: unknown (positive class: Tak)
recall: 0.00% +/- 0.00% (mikro: 0.00%) (positive class: Tak)
AUC (optimistic): 1.000 +/- 0.000 (mikro: 1.000) (positive class: Tak)
AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)
AUC (pessimistic): 0.000 +/- 0.000 (mikro: 0.000) (positive class: Tak)
k-NN:
accuracy: 88.27% +/- 2.31% (mikro: 88.27%)
ConfusionMatrix:True: Nie Tak
Nie: 39582 2612
Tak: 2672 194
precision: 6.58% +/- 1.82% (mikro: 6.77%) (positive class: Tak)
recall: 6.68% +/- 1.39% (mikro: 6.91%) (positive class: Tak)
AUC (optimistic): 0.941 +/- 0.011 (mikro: 0.941) (positive class: Tak)
AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)
AUC (pessimistic): 0.062 +/- 0.012 (mikro: 0.062) (positive class: Tak)
GLM:
accuracy: 91.27% +/- 4.57% (mikro: 91.27%)
ConfusionMatrix:True: Nie Tak
Nie: 40693 2371
Tak: 1561 435
precision: 32.86% +/- 13.99% (mikro: 21.79%) (positive class: Tak)
recall: 14.18% +/- 6.19% (mikro: 15.50%) (positive class: Tak)
AUC (optimistic): 0.575 +/- 0.018 (mikro: 0.575) (positive class: Tak)
AUC: 0.557 +/- 0.017 (mikro: 0.557) (positive class: Tak)
AUC (pessimistic): 0.539 +/- 0.018 (mikro: 0.539) (positive class: Tak)
Looking at the confiusion matrix the last one seems to be best, but still far from actually good, as only 1 in 3 times the prediction was correct for the label value being "Tak".
You could also use the various "Weight by" operators. These will create a set of weights for the attributes where the value of the weight is nearer 1 if the attribute is relevant for the label and nearer 0 if it is not. You can then use the "Select by Weights" operator to select attributes of interest based on the weights to yield an example set with only the attributes of interest.
Andrew
So now we start doing some basic data science. Each Algo you chose has its shortcomings and the results all suck IMHO.
The Decision Tree model is pure garbage, it selects thinks everything is "Nie"
The K-nn is a bit better but it really has a hard time finding "Tak." While this appears on the surface is bad, there might be some opportunity to tune the K value and make sure attributes are properly normalized.
The GLM is slightly better in a different way, but it to has a hard time discerning "Tak" as well.
So what are some of the ways you can make this better? You might want to first go back to your dataset and try to balance the data. It appears that the instances of "Tak" are smaller than the instances of "Nie." This is what we call an unbalanced set and in the case of classifcation tasks, it could cause the algo to just lump everyone into the "Nie" category, like the Decision tree did.
I would add a sample operator inside the cross validation operator (right before the algo on the training side) and toggle on "balance data," then select an equal amount of each class and then train the algo again. Also check if the attributes your using for training can be normalized if you're use K-nn, scaling can have a big impact with that algo.
You may also be interested in looking at the performance(costs) operator, which allows you to specify different costs of classification and misclassification. It may be that not all errors are equal, and the performance(costs) gives you a way to indicate the relative importance of misclassification. The modeling algorithm will then seek to minimize the costs (this operator doesn't work for all algorithms but it does for the main ones).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks to all for answers. I'm not implementing all your suggestions yet, because I want to take one step a time, make sure I know what I'm doing
My current model performance is as follows:
And the process itself looks like this:
There is improvement, but the results are still not satisfactory.
I'm pretty sure it's necessary to prepare the data more thoroughly, but I don't know how exactly.
You're right, the overall model still isn't that great BUT the classifier is starting topick up TAK a lot better. I think this model can be optimized for sure.
I would try a GBT and SVM algo. For the SVM try a radial kernal with a C=1000 and gamma = 0.01 initially. If any of these algos show improvement then I would suggest using the Grid optimization operator to automatically vary and test combinations of parameters.
Another thought is if you have a wide data set (i.e. many attributes) to do some automatic feature selection to reduce the attributes but keep the ones with the best predictive power. I'm attaching the XML for that operator here. You'd have to put that between the Cross Validation and Replace Missing value. Before you use that, i would take another look at the process as a whole. The Replace Missing value operator gives me a pause, I would think about that one to make sure that it's what you want to do.
We use this method for our own PQL scoring system and it works really well. We distill 100's of attributes down to the best 15 but it does take processing time. To make it go faster, adjust the k-folds in the Cross Validation operator and vary the initial generations/pop size, etc.
Good luck.
sta
I limited my dataset to 16 attributes (incl. the label), which I think is good enough.
Then I've repeated the modelling and here are performances of different algos:
I used the Grid optimization operator, but I was a bit disappointed with the results - in most cases manipulating parameters was basically changing the "Nie"/"Tak" class recall ratio. When the recall for "Nie" was going up, so was the overall accuracy, but that's just because there are a lot more examples of "Nie" in my exampleset. (63k examples vs 1k examples).
I went back to my data set, because I thought it's the reason why I can't squeeze out more effectiveness from these models.
I created my data purely for this project. I incorporated some patterns and regularities into it, but stil there is much randomness.
So to make the job easier (possible?) for the modelling algorithms, I turned some alerts labelled "Nie" to "Tak", for some specific customers.
Now there were 60,5k examples of "Nie" and 3,5k examples of "Tak" in my post-cleansing exampleset (though for model learning subprocess there was still 50/50 sampling).
This was enough to boost the performance of GBT to the following results:
...which is good enough for me.
After all I'm studying a hypothetical piece of AML software. If I managed to make my model predict with 100% accuracy which customers are likely to commit money-laundering, that wouldn't be very realistic.
Now coming back to my goal:
I'm trying to figure out which attributes indicate that the customer will have a true positive alert.
The model description contains this section:
I guess I can conclude that the top 3 variables are the attributes which should be taken into accunt when estimating the customer's risk.
If I go to the description of the trees, I will also be able to determine what values of these attributes are most likely to give a TRUE_POSITIVE = "Tak".
Is my understanding correct?
Hi,
yes, you are right. To be a bit more precise, the table tells you the overall, global importance for the tree. It reads like - 72% of the information needed for the classification is contained in the COUNTRY_OF_RESIDENCE attribute. What i would have a look at is the cumsum of the last coloum. Seeing this i would argue to take >5 attributes into account.
What happens if you learn the GBT only on the top4/5/6/7 attributes? I would be interested to see AUC vs Nr. Attributes for the GBT. That chart might be helpful.
A side note: This is a global number. this can still mean, that for a single customer other attributes can have a huge effect. But that would only happen in a small fraction of your customer base.
Best,
Martin
Dortmund, Germany
For 15 attributes:
Accuracy: 88.07% +/- 1.80% (mikro: 88.07%)
AUC: 0.941 +/- 0.005 (mikro: 0.941) (positive class: Tak)
For 6 top attributes:
Accuracy: 90.56% +/- 1.53% (mikro: 90.56%)
AUC: 0.934 +/- 0.007 (mikro: 0.934) (positive class: Tak)
However, the increase in accuracy came at the cost of reduced "Tak" class recall, so I went back to the wider attribute set.
OK, so now the model is built and I know the attribute importance, but one question remains:
How can I get to know which values make the model predict a "Tak" or a "Nie"?
In my example the top 2 attributes are COUNTRY_OF_RESIDENCE and PROFESSION. What are the actual countries/professions that give me a "Tak"?
If this is from the Random Forest learner, you would have to inspect the individual trees to determine that relationship.
Alternatively, you can run a Naive Bayes model on your reduced dataset with the top 16 attributes (or whatever you want to see). While the overall model might not be that accurate, the model output provides a set of views that show the relationship between your attribute values (both numerical and nominal) and your label.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Or use the Weight by Tree Importance Operator
~Martin
Dortmund, Germany
Hi good day. I am starting to learn how to use rapidminer and I want to ask what operator did you used to get this output?
It comes from the model output of the Gradient Boosted Trees operator.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
And by the way: 8.0 has an update on Random Forest and Decision Tree. Both of them are now delivering their importance on a weight port.
Best,
Martin
Dortmund, Germany