The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Stability in DT

karenkaren Member Posts: 6 Contributor II
Hi! Regarding stability of decision trees, I've been generating some C4.5  (10 attributes, 4000 instances)  with for example 81,35% accuracy +/- 1,93 in 10- fold cross validation ( good models for me)  But when I delete some of the training instances (about 10 for example)  and I re-generate the model I get a different tree and as I've been reading that's because of the ("well known")  problem of Instability of decision trees.  In spite of being a well know problem I could not find out how to study it using a formal approach (sampling? study the variance of 10 fold cross validation ) and could not find out either how to overcome it.

Please could anyone give me any  hint about how to deal with this problem?

Best
Karen

Answers

  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    Which decision tree did you use? The one which is build in RM is a bit crappy and unpredictable, so I suggest to use J48 from weka. Second thing what you can do, but rather not in RM, is to use beam search and compare the most similar trees from both runs (before and after removal of some instances), and the last is to use forests instead of trees (Forests of decision trees). At the end carefully check your trees where do they differ and how big these trees are. Usually they can differ at the bottom of the tree that is reasonable, when you have small portion of data but it is also possible that the trees differ at the root of the tree when the data is noisy or the data structure is complex. For example if you have continuous data with a uniform  distribution  but the class labels are marked according to the xor function then you may obtain completely different tree if you even remove a single instance (the root of the tree may switch with the branches), but the accuracy on unseen data will be almost identical. But that is correct and the problem is not with the decision trees but with the data or how we understand the data according to the obtained trees.
Sign In or Register to comment.