The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Data mining case: your advice wanted
Hi guys,
following is a description of a real dataset study done by the author. I'd like to get a feedback what is done good/what should be improved. I hope problem would be interesting for you to think of, and thank you in advance for any ideas.
Data overview. Credit data set with 10k examples and 500 attributes. Credits were issued to wide range of extremely risk customers and total default rate is about 50%. Credit applications were done online, thus a lot of data points from various sources collected, resulting in 500 attributes with information gain weight > 0.001 and correlation < 0.95. Attributes come from naturally different data sources, however, we can’t say they are independent. Correlation between attributes from different sources may be described as significant by nature, because good customers have a full set of “positive predictors” and bad customers have a full set of “negative predictors”. The task is to build classification model better separating good and defaulted customers.
Weightings. If we get attributes' information gain weights, sort it descending order and calculate a running total (sum), proportions would be the following: top 50 attributes contain 50% of total information, top 100 attributes – 70%, top 250 – 95%. (Maybe summing weights is not such a good idea, however, I hope it gives a correct brief overview). Thus, it seems there is almost no chance to improve model accuracy with further attribute selection after top 250 was included, and we should mostly work with top 250 attributes.
Some attributes characterize label only within a segment, thus attribute selection by gain ratio may work better. Combining weights of information gain and gain ration were tried, including their: average (also with different proportions), maximum and product. Building model with top 100 attributes by product of weights seemed to work a slightly better than by weights of information gain only, but not significant.
Base learner. As a base learner Random Forest (Weka) was selected, due to overall efficiency, speed, and less need to prepare data. SVM and non-linear logistic regressions tests & tune ups were tried, but while execution time was not higher x5 of Random Forest, they showed significantly worse results. Thus all further approaches were tested with Weka Random Forest.
Modeling. 5 times looped and averaged 4-fold x-validation with 100-tree Weka Random Forest (and default k and depth) gives good results, improving accuracy while selecting top weighted attributes up to 100. After top 100 attributes, accuracy stops improving, and the question is how to obtain information remaining in rest 150 attributes.
Tuning up RF params K or Depth does not give any noticeable improvement. If default K for 100 atts would be 7, we try K from 7 to 20 with 100 and 200 trees. As Breiman described, in case of independent atts increasing K should give an effect, but we can’t notice it with our data. Probably, number of trees should be greater when increasing K, should 200 be enough for 250 atts?
PCA approach. As an approach to obtain information from low-weighted atts, let’s try to reduce space and improve variable variance by Principal Components. PCA is applied separately to each data source in order not to lose information. But even though, it reduces model accuracy comparing to original atts, so we keep top N original atts and only apply PCA to the rest atts, then we join original and transformed atts. With one data source, this approach gave noticeable result when we build model using this data source only. However, joining this data with all the rest attributes (all data sources) does not improve our general model accuracy. We vary parameter N in hope to find a good split point of keeping original atts and generating new, but as a result we still don’t get noticeable improvement.
Boosting approach. Maybe, it’s by nature obvious that RF cannot be boosted. However, we try all implemented boosting operators to check this, and we mostly get significantly worse results. Maybe boosting improperly configured – is it possible to boost RF?
Optimize attributes selection. We have done atts selection optimization due to high computing time. Probably, only full brute force optimization may give improvement and, as seems, not significant. Your advice on whether optimizing selection may give improvement would be useful.
Concluding all above, Random Forest seems greatly efficient, our dataset seems well suited for it. But we stop gaining improvement after top 100 attributes selected by weights. This may mean that no new information is contained in the rest attributes, however, we have no proof of it.
General question is which approaches should be tried for improvement, or, if no improvement is possible further, how to prove it.
Since the study is done by a newbie, any your feedback, ideas and experience would be very helpful!
Thanks,
and special thanks to RapidMiner team for a great analytic tool.
following is a description of a real dataset study done by the author. I'd like to get a feedback what is done good/what should be improved. I hope problem would be interesting for you to think of, and thank you in advance for any ideas.
Data overview. Credit data set with 10k examples and 500 attributes. Credits were issued to wide range of extremely risk customers and total default rate is about 50%. Credit applications were done online, thus a lot of data points from various sources collected, resulting in 500 attributes with information gain weight > 0.001 and correlation < 0.95. Attributes come from naturally different data sources, however, we can’t say they are independent. Correlation between attributes from different sources may be described as significant by nature, because good customers have a full set of “positive predictors” and bad customers have a full set of “negative predictors”. The task is to build classification model better separating good and defaulted customers.
Weightings. If we get attributes' information gain weights, sort it descending order and calculate a running total (sum), proportions would be the following: top 50 attributes contain 50% of total information, top 100 attributes – 70%, top 250 – 95%. (Maybe summing weights is not such a good idea, however, I hope it gives a correct brief overview). Thus, it seems there is almost no chance to improve model accuracy with further attribute selection after top 250 was included, and we should mostly work with top 250 attributes.
Some attributes characterize label only within a segment, thus attribute selection by gain ratio may work better. Combining weights of information gain and gain ration were tried, including their: average (also with different proportions), maximum and product. Building model with top 100 attributes by product of weights seemed to work a slightly better than by weights of information gain only, but not significant.
Base learner. As a base learner Random Forest (Weka) was selected, due to overall efficiency, speed, and less need to prepare data. SVM and non-linear logistic regressions tests & tune ups were tried, but while execution time was not higher x5 of Random Forest, they showed significantly worse results. Thus all further approaches were tested with Weka Random Forest.
Modeling. 5 times looped and averaged 4-fold x-validation with 100-tree Weka Random Forest (and default k and depth) gives good results, improving accuracy while selecting top weighted attributes up to 100. After top 100 attributes, accuracy stops improving, and the question is how to obtain information remaining in rest 150 attributes.
Tuning up RF params K or Depth does not give any noticeable improvement. If default K for 100 atts would be 7, we try K from 7 to 20 with 100 and 200 trees. As Breiman described, in case of independent atts increasing K should give an effect, but we can’t notice it with our data. Probably, number of trees should be greater when increasing K, should 200 be enough for 250 atts?
PCA approach. As an approach to obtain information from low-weighted atts, let’s try to reduce space and improve variable variance by Principal Components. PCA is applied separately to each data source in order not to lose information. But even though, it reduces model accuracy comparing to original atts, so we keep top N original atts and only apply PCA to the rest atts, then we join original and transformed atts. With one data source, this approach gave noticeable result when we build model using this data source only. However, joining this data with all the rest attributes (all data sources) does not improve our general model accuracy. We vary parameter N in hope to find a good split point of keeping original atts and generating new, but as a result we still don’t get noticeable improvement.
Boosting approach. Maybe, it’s by nature obvious that RF cannot be boosted. However, we try all implemented boosting operators to check this, and we mostly get significantly worse results. Maybe boosting improperly configured – is it possible to boost RF?
Optimize attributes selection. We have done atts selection optimization due to high computing time. Probably, only full brute force optimization may give improvement and, as seems, not significant. Your advice on whether optimizing selection may give improvement would be useful.
Concluding all above, Random Forest seems greatly efficient, our dataset seems well suited for it. But we stop gaining improvement after top 100 attributes selected by weights. This may mean that no new information is contained in the rest attributes, however, we have no proof of it.
General question is which approaches should be tried for improvement, or, if no improvement is possible further, how to prove it.
Since the study is done by a newbie, any your feedback, ideas and experience would be very helpful!
Thanks,
and special thanks to RapidMiner team for a great analytic tool.
0
Answers
it seems like you already did a lot of interesting stuff. Using a PCA would have been my first reaction too.
Just unorderd things which come to my mind:
1. Using WEKA is good, but given the computing time, i would try the rapidminer one. With release 6.2 the DecTree and the RF got SIGNIFICANTLY faster
2. Throwing everything you have on the learner is almost for sure not the optimal thing or would in best cast just increase the computing time.So i think the key point is a good feature Selection. What about a Forward Selection using Naive Bayes, Evolutionary with DecTree or a Weight by SVM followed by an SVM (maybe with an Hold-Out Validation?)
3. Boosting a RF is no good idea. Try to boost the Dec-Tree
4. The Weka forest is the original implementation by Breiman, so it uses gini index. A feature selection via information gain (= Kulback Leibner divergence) seems not appropriate
5. Did you try Weight by Tree Importance for feature selection?
6. Tree based Algorithms might have problems with XOR-Like structures, have you thought about it?
7. How do you prepare your data? Somethings its useful to use dummy variables even for an RF.
8. Have you tried MRMR Feature selection from the Feature selection Extension? http://sourceforge.net/projects/rm-featselext/
9. Out of curiosity: How does a Neural Net perform?
10. What are your experiences with SVMs on this dataset?
thats just what came to my mind..
Dortmund, Germany
thanks a lot for usefull feedback. Here is what I could check for today:
1. Using WEKA is good, but given the computing time, i would try the rapidminer one. With release 6.2 the DecTree and the RF got SIGNIFICANTLY faster
What parameters should be used to get the same model to WEKA forest?
I suggest: criterion - gini index, depth unlimited, confidence & gain to mininum/zero.
What I got this way: RapidMiner forest gives AUC around 0.71, WEKA - 0.77. The best what I could get playing (really bruteforcing ;-) with RM forest params is AUC 0.72. And it executes too fast (couple sec), when Weka needs about 40 sec, - seems not right.
Auc itself may be not the best measure - I calculate own performace criteria, and it's strongly correlated with auc.
4. The Weka forest is the original implementation by Breiman, so it uses gini index. A feature selection via information gain (= Kulback Leibner divergence) seems not appropriate
Interesting that information gain selection works slightly better on some numbers of attts, while gini wins on the others...
5. Did you try Weight by Tree Importance for feature selection?
Tree importance (forest build by gini index) gives completely other Top attribute set. While gini, inform gain, gain ratio, or even linear corr gives similar or almost the same selections. Tree importance selection gave auc ~0.7 (with 0.77 best archieved). Probably random forest should be well tuned before.
7. How do you prepare your data? Somethings its useful to use dummy variables even for an RF.
Yes, there is a problem with polynominal atts. Keeping them as polynominal reduces (!) performace. Dummy coding gives no improvement. Removing them totally almost does not (!) reduce performance. Keeping them as unique integers is slightly better. Why this happens I'm not aware for now.
8. Have you tried MRMR Feature selection from the Feature selection Extension? http://sourceforge.net/projects/rm-featselext/
Thanks, I didn't know about this extension. Forest with MRMR showed itself significanlty worse (auc 0.73-0.74).
9. Out of curiosity: How does a Neural Net perform?
Quick tests gave not more than 0.72.
If you just need 10 seconds for a forest, why not loop over some settings log the results and let it run for a night? Did you try to take the new Dec-Tree for a evolutionary FS? Having that different FS seems to indicate that there is something to gain.. Have you tried to take the union of the selected attributes? probably yes.. Did you compare different selections by hand or using something like Jaccard-Index? So how do you handle them? Unique integers? This extension is once again a hidden gem. I don't know why it is not on the marketplace. I just know about it because i worked together with the author of it .
For performance: Try to loop over the number of selected attributes and log the AUC. Maybe you see some trend. What happens if you add layers? And what is the std_dev on it?
One general thing: You should really do something "systematic". Just trying out things is interesting, but maybe not as useful as using a Optimize by Grid and letting it run for a night on a RM Server. Using the Logs there is kind of useful and gives you more insight.
Best,
Martin
Dortmund, Germany