Newbie question - cross validation using decision tree

mystic86 · July 2016

Hi,

Really sorry about such a stupid question, I am very new!

I want to do cross validation with decision tree, here is my dataset:

Screen Shot 2016-07-30 at 8.38.09 p.m..png

Here is the setup I have:

mystic86 · July 2016

Sorry, the rest of my first post was cut off

Here is the setup I have:

Screen Shot 2016-07-30 at 8.39.08 p.m..png

And here is the setup inside the X-validation:

Screen Shot 2016-07-30 at 8.39.53 p.m..png

mystic86 · July 2016

When I run this, it seems to be doing nothing at all except summarising my data pretty much - here are some screenshots of the results:

Screen Shot 2016-07-30 at 8.41.39 p.m..png

Can anyone help me figure out what is going on here - is it something to do with how my attributes are setup in terms of their roles etc? ....

Screen Shot 2016-07-30 at 8.44.20 p.m..png

Thanks!!

bhupendra_patil · July 2016

Hi Philip,

Basically it looks like the model is predicting that everything is YES, so basically a very incorrect model.

Is it possible for you to share the data ?

Are you applying pruning on the decision tree ? Try without or changing the confidence values.

You may also try some other models, since decision tree does not seem to be getting close.

Keep in mind some learners can only predict binary values, some polynominal values, some numbers and there are also limitation on kind of data type that can be input variables.

You can use RM operators to tweak data to meet those, but plan accordingly.

Thomas_Ott · July 2016

Some other questions to ask, does the data need to be balanced too? Is there any feature generation that can be done?

yyhuang · August 2016

@Thomas_Ott raises a good point, unbalanced data is an interesting and very frequenct problem in classfications.

As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is imbalanced, the classifer (not only decision trees) yields an optimistic accuracy estimate. In an extreme case (just like your example), the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test classes belonging to the majority classes. Some strategies for learning from unbalanced data:

1. Under sampling,

by removing samples from the majority class using an udersampling algorithm, for instance using absolute sized Sample to balance data with specified sample size per calss in RapidMiner

2. Oversampling,

by generating new samples from the minority class using an oversampling algorithm, for instance Bootstraping Sample in rapidminer

3. Cost-sensitive learning,

by chaning the decision tree build algorithm so that the misclassifications of minority class samples have a higher cost than misclassifications of majority calss samples. The MetaCost in rapidminer is a good choice. Plz refer to the built-in tutorial process for

Using the MetaCost operator for generating a better Decision Tree

4. Ensemble learning,

by trying to use several decision trees instead of using a single decision tree. Check out Bagging algorithm in rapidminer for booststrap aggregating decision tree models. In our latest release Rapidminer 7.2, Gradient Boosted TreesIngoRM :smileywink: and say hello to our favourite learners.

https://rapidminer.com/gradient-boosted-trees-deep-learning-less-5-minutes-bet/

hello new alg.png

5. Combination,

by combining undersampling, oversampling, and ensemble learning strategies. Most state of art learning methids for imbalanced data use a combination of defferent strategies. Choose the one that is best for you. I would recommend to consider at leaset two of the mentioned approaches in conjuctions.

We would be happy to post some additional references to the literature if you would like to follow up on this.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Newbie question - cross validation using decision tree

Answers