The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Help Understanding Cross Validation and Decision Trees"
spitfire_ch
Member Posts: 38 Maven
Hi,
I do have some problems understanding how the decision tree algorithm works in combination with cross validation. Another user on Stack Overflow apparently had the very same question, which I could not put in a better way. So I appologize for simply copy pasting it:
http://stackoverflow.com/questions/2314850/help-understanding-cross-validation-and-decision-trees
[quote author=chubbard]I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
1. Decide on the number of folds you want (k)
2. Subdivide your dataset into k folds
3. Use k-1 folds for a training set to build a tree.
4. Use the testing set to estimate statistics about the error in your tree.
5. Save your results for later
6. Repeat steps 3-6 for k times leaving out a different fold for your test set.
7. Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
[/quote]
The posted answers don't really answer the question for me. First of all, to my understanding, one should not use the same data for training and testing. So use 100% of your data for training and then test the model using the same data is a no go. Secondly, when you put a decision tree learner in the left (training) part of a cross validation operator, it should indeed create a (possibly different) model for each iteration. So, the question remains: Which tree is chosen in the end (the one you see when you choose to output the model)? Or is this in fact some kind of average model?
Or is the idea of X validation not to actually create a model, but rather see, how a certain learner with certain parameters would perform with your data? In such a case, however, I don't understand how you would build the actual model. If you use 100% of your data for the training, you would not have any test data left for post-pruning and prevention of overfitting ...
Thank you very much for shedding some light on this topic and best regards
Hanspeter
I do have some problems understanding how the decision tree algorithm works in combination with cross validation. Another user on Stack Overflow apparently had the very same question, which I could not put in a better way. So I appologize for simply copy pasting it:
http://stackoverflow.com/questions/2314850/help-understanding-cross-validation-and-decision-trees
[quote author=chubbard]I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
1. Decide on the number of folds you want (k)
2. Subdivide your dataset into k folds
3. Use k-1 folds for a training set to build a tree.
4. Use the testing set to estimate statistics about the error in your tree.
5. Save your results for later
6. Repeat steps 3-6 for k times leaving out a different fold for your test set.
7. Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
[/quote]
The posted answers don't really answer the question for me. First of all, to my understanding, one should not use the same data for training and testing. So use 100% of your data for training and then test the model using the same data is a no go. Secondly, when you put a decision tree learner in the left (training) part of a cross validation operator, it should indeed create a (possibly different) model for each iteration. So, the question remains: Which tree is chosen in the end (the one you see when you choose to output the model)? Or is this in fact some kind of average model?
Or is the idea of X validation not to actually create a model, but rather see, how a certain learner with certain parameters would perform with your data? In such a case, however, I don't understand how you would build the actual model. If you use 100% of your data for the training, you would not have any test data left for post-pruning and prevention of overfitting ...
Thank you very much for shedding some light on this topic and best regards
Hanspeter
Tagged:
0
Answers
How are the validation records chosen in RapidMiner? Are they automatically selected from the training set? Is the assumption correct, that validation methods such as X-validation do not influence the post-pruning step at all? Is it possible to influence the ratio of training records : validation records?
What steps are suggested in building a "final model"?
First find the most suitable algorithm and parameters by using e.g. X-validation.
Then use this algorithm and parameters on the entire data available to build the final model?
Thank you very much for providing some insight into this.
Text is sometimes not the medium to make an explanation, so I'd urge you to pull up one of the many samples which contain validation, and then ask your questions with the graphical representation in front of you. They don't, they get examples from the validator, and the validator uses the model they produce on other examples. Yes, by the validator. It is not true that the validator partitions the training set further. Learners may further partition the training examples in order to produce a model, but this is internal and does not alter the example splitting done by the validator. The Validator selects the test examples. There are several validators in RapidMiner, each for different scenarios, and each with different selection methods, and parameters... Yes. Here's where the picture comes in, look at the inputs to the Learner it is just the training example set. And much, much more given the different validators, and their parameters... Validation and parameter optimisation are but two of several building blocks you will need to build your application, which you mention as being in the commercial field of mail-shotting. Would it not make sense to contact RM on a commercial basis, either for training or consultancy, time is money etc.,etc.? Just my 2c.
Thank you very much for your elaborate answers!
The principle of using a testing dataset for validation to get a performance measure is rather clear to me. What I am not sure about is how model refinement is done. For example, for post pruning requires validation records from the training set, not the testing set. The testing set is then used to get the performance of the final model, after refinement. At least this is the way how it is described in "Data Mining Techniques and Applications" by Hongbo Du. So, since you can't define validation records, but only the test set, I don't understand how post pruning works in Rapidminer. Does Rapidminer use the testing set for this? Does this mean, you can't do any postpruning without choosing a validator? Or is it rather as you suggested: Basically, are the following assumption right?
- Post pruning (e.g. Reduced-Error Pruning) is not based on the testing set. Rather, the learner algorithm splits the training set internally and uses a part of it for model refinement. The user has no influence on that.
- The testing set / the validator have no influence on model building, but are solely used for measuring the performance of already refined models. The exception here is parameter optimization (and similar operators), which takes performance measures as input.
It would and I have already considered it. The RM team (and its community members including yourself) are extremely helpful, competent and friendly, so it is really tempting. Problem is: I am not doing this at work, but rather in my free time. At work, I use less elaborate techniques such as SQL scripts, R and Excel to get the information I need. In my free time, I read about DataMining and try to apply what I learned in RapidMiner. Especially, I try to get better answers to the problems at work than with conventional techniques. So far, I failed. Without results, my employer will never agree to buy an enterprise contract. We're just a small company and don't have that many problems that really ask for DataMining solutions. As a private person, the enterprise edition is way way way out of my reach, unfortunately. If I could, I would, that's for sureBest regards
Hanspeter
My point about the graphical advantage of the UI holds here as well. You can mouse over the flow, and hover over the input and ouput ports to see just what is being passed around, examples, models, performances and the rest. It is a pretty good way of checking that you understand the underlying flows, and their sequences.
I'll try to follow your advise and use the GUI tools to get a deeper understanding of the underlying flows.
Best regards
Hanspeter