"Help Understanding Cross Validation and Decision Trees"

spitfire_ch · November 2010

Hi,

I do have some problems understanding how the decision tree algorithm works in combination with cross validation. Another user on Stack Overflow apparently had the very same question, which I could not put in a better way. So I appologize for simply copy pasting it:

http://stackoverflow.com/questions/2314850/help-understanding-cross-validation-and-decision-trees
[quote author=chubbard]I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:

1. Decide on the number of folds you want (k)
2. Subdivide your dataset into k folds
3. Use k-1 folds for a training set to build a tree.
4. Use the testing set to estimate statistics about the error in your tree.
5. Save your results for later
6. Repeat steps 3-6 for k times leaving out a different fold for your test set.
7. Average the errors across your iterations to predict the overall error

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).

As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.

What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
[/quote]

The posted answers don't really answer the question for me. First of all, to my understanding, one should not use the same data for training and testing. So use 100% of your data for training and then test the model using the same data is a no go. Secondly, when you put a decision tree learner in the left (training) part of a cross validation operator, it should indeed create a (possibly different) model for each iteration. So, the question remains: Which tree is chosen in the end (the one you see when you choose to output the model)? Or is this in fact some kind of average model?

Or is the idea of X validation not to actually create a model, but rather see, how a certain learner with certain parameters would perform with your data? In such a case, however, I don't understand how you would build the actual model. If you use 100% of your data for the training, you would not have any test data left for post-pruning and prevention of overfitting ...

Thank you very much for shedding some light on this topic and best regards
Hanspeter

spitfire_ch · January 2011

Sorry for bumping that, but this topic still is not very clear to me. How do the RapidMiner decision tree algorithms handle validation? As far as I understood, the dataset is split into a training set and a test set (used for measuring performance. The training set itself is split again into training records and validation records. The latter will be used for model refinement by post-pruning, e.g. by the reduced-error pruning method.

How are the validation records chosen in RapidMiner? Are they automatically selected from the training set? Is the assumption correct, that validation methods such as X-validation do not influence the post-pruning step at all? Is it possible to influence the ratio of training records : validation records?

What steps are suggested in building a "final model"?
First find the most suitable algorithm and parameters by using e.g. X-validation.
Then use this algorithm and parameters on the entire data available to build the final model?

Thank you very much for providing some insight into this.

haddock · January 2011

Hi there,

Text is sometimes not the medium to make an explanation, so I'd urge you to pull up one of the many samples which contain validation, and then ask your questions with the graphical representation in front of you.

How do the RapidMiner decision tree algorithms handle validation?

They don't, they get examples from the validator, and the validator uses the model they produce on other examples.

As far as I understood, the dataset is split into a training set and a test set (used for measuring performance.

Yes, by the validator.

The training set itself is split again into training records and validation records. The latter will be used for model refinement by post-pruning, e.g. by the reduced-error pruning method.

It is not true that the validator partitions the training set further. Learners may further partition the training examples in order to produce a model, but this is internal and does not alter the example splitting done by the validator.

How are the validation records chosen in RapidMiner?

The Validator selects the test examples. There are several validators in RapidMiner, each for different scenarios, and each with different selection methods, and parameters...

Is the assumption correct, that validation methods such as X-validation do not influence the post-pruning step at all?

Yes. Here's where the picture comes in, look at the inputs to the Learner it is just the training example set.

Is it possible to influence the ratio of training records : validation records?

And much, much more given the different validators, and their parameters...

What steps are suggested in building a "final model"?
First find the most suitable algorithm and parameters by using e.g. X-validation.
Then use this algorithm and parameters on the entire data available to build the final model?

Validation and parameter optimisation are but two of several building blocks you will need to build your application, which you mention as being in the commercial field of mail-shotting. Would it not make sense to contact RM on a commercial basis, either for training or consultancy, time is money etc.,etc.? Just my 2c.

spitfire_ch · January 2011

Hi Haddock

Thank you very much for your elaborate answers!

The principle of using a testing dataset for validation to get a performance measure is rather clear to me. What I am not sure about is how model refinement is done. For example, for post pruning requires validation records from the training set, not the testing set. The testing set is then used to get the performance of the final model, after refinement. At least this is the way how it is described in "Data Mining Techniques and Applications" by Hongbo Du. So, since you can't define validation records, but only the test set, I don't understand how post pruning works in Rapidminer. Does Rapidminer use the testing set for this? Does this mean, you can't do any postpruning without choosing a validator? Or is it rather as you suggested:

It is not true that the validator partitions the training set further. Learners may further partition the training examples in order to produce a model, but this is internal and does not alter the example splitting done by the validator.

Basically, are the following assumption right?

Post pruning (e.g. Reduced-Error Pruning) is not based on the testing set. Rather, the learner algorithm splits the training set internally and uses a part of it for model refinement. The user has no influence on that.
The testing set / the validator have no influence on model building, but are solely used for measuring the performance of already refined models. The exception here is parameter optimization (and similar operators), which takes performance measures as input.

Would it not make sense to contact RM on a commercial basis, either for training or consultancy, time is money etc.,etc.? Just my 2c.

It would and I have already considered it. The RM team (and its community members including yourself) are extremely helpful, competent and friendly, so it is really tempting. Problem is: I am not doing this at work, but rather in my free time. At work, I use less elaborate techniques such as SQL scripts, R and Excel to get the information I need. In my free time, I read about DataMining and try to apply what I learned in RapidMiner. Especially, I try to get better answers to the problems at work than with conventional techniques. So far, I failed. Without results, my employer will never agree to buy an enterprise contract. We're just a small company and don't have that many problems that really ask for DataMining solutions. As a private person, the enterprise edition is way way way out of my reach, unfortunately. If I could, I would, that's for sure

Best regards
Hanspeter

haddock · January 2011

Hi there,

I don't understand how post pruning works in Rapidminer

Pruning is done by the learner on whatever examples go through its input port, whether that learner is within an optimsed validation double loop, or plain and simple, just like this sample..

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Root">
    <description>&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;p&gt;This process starts with loading the data. After finishing the input operator a typical learning step is performed. Here, an implementation of a decision tree learner is used which also can handle numerical values (similar to the well known C4.5 algorithm).&lt;/p&gt;&lt;/td&gt;&lt;td&gt;&lt;icon&gt;groups/24/learner&lt;/icon&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;p&gt;Each operator may demand some input and delivers some output. These in- and output types are passed between the operators. In this example the first operator &amp;quot;Input&amp;quot; does not demand input and delivers an example set as output. This example set is consumed by the learner which delivers the final output: the learned model. &lt;/p&gt;&lt;p&gt;Since this is a linear data flow such a process design is called &amp;quot;operator chain&amp;quot;. Later we will see more sophisticated processes in the form of operator trees.&lt;/p&gt;&lt;p&gt;Try the following:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Press the &amp;quot;Play&amp;quot; icon in the icon bar at the top of the frame. The process should start and after a short time the message viewer in the bottom part of the frame shows the message that the process was successfully finished. The main frame changes to &amp;quot;Result&amp;quot; view which shows the learned decision tree (a hypothesis which is called Model in RapidMiner). &lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;icon&gt;24/media_play&lt;/icon&gt;&lt;/td&gt;&lt;td&gt;&lt;i&gt;&amp;quot;Play&amp;quot; button to start the process&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/li&gt;&lt;li&gt;Change back to the edit mode (either via the View menu entry, via the icon in the upper right corner, or via the shortcut F9).  Replace the learner by another learning scheme for classification tasks (right click on decision tree learner and replace operator). You can use the RuleLearner for example. After running the changed process the new model is presented.&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;icon&gt;groups/24/learner&lt;/icon&gt;&lt;/td&gt;&lt;td&gt;&lt;i&gt;Replace the decision tree learner by a RuleLearner.&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/li&gt;&lt;/ul&gt;</description>
    <process expanded="true" height="584" width="300">
      <operator activated="true" class="retrieve" compatibility="5.0.000" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="../../data/Golf"/>
      </operator>
      <operator activated="true" class="decision_tree" compatibility="5.0.000" expanded="true" height="76" name="DecisionTree" width="90" x="180" y="30"/>
      <connect from_op="Retrieve" from_port="output" to_op="DecisionTree" to_port="training set"/>
      <connect from_op="DecisionTree" from_port="model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

My point about the graphical advantage of the UI holds here as well. You can mouse over the flow, and hover over the input and ouput ports to see just what is being passed around, examples, models, performances and the rest. It is a pretty good way of checking that you understand the underlying flows, and their sequences.

spitfire_ch · January 2011

Ok, thanks four your explanation. So, the assumptions are correct.

I'll try to follow your advise and use the GUI tools to get a deeper understanding of the underlying flows.

Best regards
Hanspeter

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Help Understanding Cross Validation and Decision Trees"

Answers