Basic question about decision tree
hi,
I have read that when Building a tree, the tree is tested against Hold-out-Data (or Out-of-Bag Data?), well at least there must be some unseen test data from which the tree was no constructed, to see the performance of that tree, and also for Error pruning in the after, like in REP-Tree (Reduced-error-pruning), when the branches are tested against new testdata, and it is watched how it performs on the branches. If it does not perform well, lets say it could be overfitted. Then branches are pruned to make more general leaf nodes for better performance...
my question is, from where does it get the testdata / hold-out-data/ Out-of-bag data (or whatever it is called) ? How does it split the data for constructing the tree? I didnt read about that anywhere (also not in papers..?)
Answers
I found a paper here in which it is stated I guess:
http://research.ijcaonline.org/volume117/number16/pxc3903318.pdf
apparently, it is first created a subset of features on training data, then training data is split into test and training data in a Cross-validation manner, then the performance of that particular tree is tested with the test-data in the X-Validation.
but I still have a question: Will there be constructed several different trees on one subset of features only? or will they be built on several different subsets of features? and how are the features chosen? In case of C5 it is said "With the use of Genetic Search Apply Feature Selection technique"