The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Random Forest overfitting
Hi everybody,
I am running RF for a typical binary classification problem (with 34
cases and 530 variables = training set) using RapidMiner. The first
and major problem is that the algorithm is producing every-time a 100%
performance on this training set which CAN'T be true and which makes
me strongly believe that it is doing over-fitting. But the algorithm
developers, Breiman & Cutler, have specifically remarked that RF
doesn't do over-fitting. So I am wondering if other people have
similar experience and suggestion on how to avoid it.
I have tried all sort of options to avoid it (like pruning, increasing
number of trees, increasing variables at each node etc.). The thing I
have not done (and am not willing to do) is to reduce the number of
variables as I want to run it in an unbiased way without having some
'a priori' selection of 'important' variables. Moreover, as far as the
literature goes, RF should do well where variables are large in
numbers and cases are small (n << p).
Any bit of help/suggestion will be highly appreciated.
TIA!
san
I am running RF for a typical binary classification problem (with 34
cases and 530 variables = training set) using RapidMiner. The first
and major problem is that the algorithm is producing every-time a 100%
performance on this training set which CAN'T be true and which makes
me strongly believe that it is doing over-fitting. But the algorithm
developers, Breiman & Cutler, have specifically remarked that RF
doesn't do over-fitting. So I am wondering if other people have
similar experience and suggestion on how to avoid it.
I have tried all sort of options to avoid it (like pruning, increasing
number of trees, increasing variables at each node etc.). The thing I
have not done (and am not willing to do) is to reduce the number of
variables as I want to run it in an unbiased way without having some
'a priori' selection of 'important' variables. Moreover, as far as the
literature goes, RF should do well where variables are large in
numbers and cases are small (n << p).
Any bit of help/suggestion will be highly appreciated.
TIA!
san
0
Answers
If it does not work, use another classifier?
Maybe boosting of decision stumps.
Why would a random forest not over-fit?
As far as I recall a random forest selects a variable randomly, and then calculates the optimal split for this variable.
By iterating this process it creates a single tree. By bagging this method multiple trees can be created to form an ensemble.
Surely calculating the optimal split is a process likely to over-fit?
A learning algorithm with a random component has a big variance.
Bagging is used to reduce this variance.
Bagging is not some kind of magic which can prevent over-fitting.
However: the RapidMiner RF seems to do the random variable selection for the tree, not for the nodes. Thus ONE set of variables is used for the whole tree. As this would be a distinct difference to the Breimans version, I'm also not sure about the implementation of the accuracy estimation. Maybe the estimation is not based on the left-out cases for their according tree, but on the whole example set running through the forest. Because latter case could easily lead to your 100 percent accuracy as it would be testing on the training set (also high variance of the classifier and such a small example set).
But I'm not sure about it all. I rather raised a question than answering yours,
greetings
Bagging is short for bootstrap aggregating.
I'm not sure I understand your comment on the workings on RapidMiner RF:
Accuracy estimation running trough the forest? Eh? Accuracy estimation is an internal processes in RF?