X-Validation runs training X+1 times

spoi · May 2013

Hi,

I used the X-Validation operator in the last days quite oftern and choosed 10 as number of validations.
But as I see from the statusbar (see image below) the operator in the training section of the X-Validation operator is not executed 10 times as I would expect but one time more: 11 times.

The 11th run of the training operator needes roughly the same time as the other training runs.

Even worse: If I use the "X-Validation (Parallel)" operator and allow 32 threads (I have 32 cores) the first 10 runs get executed in parallel but the 11th run waits for the 10th runs to finish and stats after that. This doubles the execution time.

My questin is now: What is this 11th run for? Is ths a but or a feature? Is there any way how I could speed up the process e.g. run the 11th run in parallel to the other 10 runs.

Regards

Nils_Woehler · May 2013

Hi,

no it is not a bug, it is a feature. :-)
After running the X-Validation k times it is run a k+1 time to create a model on the complete example set provided. This model is delivered at the Validation.model port.
As the last training is done on the complete data set this can in fact take quite a long time.
And unfortunately it is currently not possible to skip the last modeling phase. But I've created an internal ticket to start the last training only if the model port is connected.

Best,
Nils

spoi · May 2013

Thx Nils,

THX for your reply.

In addition to the ticket you created it would be cool if the k + 1 time could be started in parallel to the other k learnings (if there are enough threads).
Or probably it is somehow possible to use the k'th model after the testing and postlearn the used testdata and use the resulting model as final result.

awchisholm · May 2013

Hello

For fun I created the following process that "rolls its own" x-validation that you may be able to use to get the parallel execution you need (I haven't tried it to confirm this last point since I don't have a powerful enough machine to try it on).

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="179" y="30">
        <description>A cross-validation evaluating a decision tree model.</description>
        <parameter key="sampling_type" value="linear sampling"/>
        <process expanded="true">
          <operator activated="true" class="materialize_data" compatibility="5.3.008" expanded="true" height="76" name="Materialize Data (3)" width="90" x="45" y="30"/>
          <operator activated="true" class="default_model" compatibility="5.3.008" expanded="true" height="76" name="Default Model" width="90" x="179" y="30">
            <parameter key="method" value="attribute"/>
            <parameter key="attribute_name" value="label"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="5.3.008" expanded="true" height="76" name="Generate Macro" width="90" x="179" y="165">
            <list key="function_descriptions">
              <parameter key="loopCounter" value="%{a}"/>
            </list>
          </operator>
          <operator activated="true" class="remember" compatibility="5.3.008" expanded="true" height="60" name="Remember" width="90" x="179" y="255">
            <parameter key="name" value="&quot;%{loopCounter}&quot;_train"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <connect from_port="training" to_op="Materialize Data (3)" to_port="example set input"/>
          <connect from_op="Materialize Data (3)" from_port="example set output" to_op="Default Model" to_port="training set"/>
          <connect from_op="Default Model" from_port="model" to_port="model"/>
          <connect from_op="Default Model" from_port="exampleSet" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_op="Remember" to_port="store"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="materialize_data" compatibility="5.3.008" expanded="true" height="76" name="Materialize Data (4)" width="90" x="45" y="165"/>
          <operator activated="true" class="remember" compatibility="5.3.008" expanded="true" height="60" name="Remember (3)" width="90" x="45" y="255">
            <parameter key="name" value="&quot;%{loopCounter}&quot;_test"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="313" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Materialize Data (4)" to_port="example set input"/>
          <connect from_op="Materialize Data (4)" from_port="example set output" to_op="Remember (3)" to_port="store"/>
          <connect from_op="Remember (3)" from_port="stored" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_macro" compatibility="5.3.008" expanded="true" height="76" name="Generate Macro (2)" width="90" x="179" y="165">
        <list key="function_descriptions">
          <parameter key="loopCounter" value="%{loopCounter}+1"/>
        </list>
      </operator>
      <operator activated="true" class="remember" compatibility="5.3.008" expanded="true" height="60" name="Remember (2)" width="90" x="179" y="255">
        <parameter key="name" value="&quot;%{loopCounter}&quot;_train"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <operator activated="true" class="remember" compatibility="5.3.008" expanded="true" height="60" name="Remember (4)" width="90" x="179" y="345">
        <parameter key="name" value="&quot;%{loopCounter}&quot;_test"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <operator activated="true" class="loop" compatibility="5.3.008" expanded="true" height="130" name="Loop" width="90" x="380" y="30">
        <parameter key="set_iteration_macro" value="true"/>
        <parameter key="iterations" value="%{loopCounter}"/>
        <process expanded="true">
          <operator activated="true" class="recall" compatibility="5.3.008" expanded="true" height="60" name="Recall (2)" width="90" x="45" y="345">
            <parameter key="name" value="&quot;%{iteration}&quot;_test"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <operator activated="true" class="recall" compatibility="5.3.008" expanded="true" height="60" name="Recall" width="90" x="45" y="30">
            <parameter key="name" value="&quot;%{iteration}&quot;_train"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <operator activated="true" class="neural_net" compatibility="5.3.008" expanded="true" height="76" name="Neural Net" width="90" x="179" y="30">
            <list key="hidden_layers"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="313" y="345">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (2)" width="90" x="447" y="120"/>
          <operator activated="true" class="materialize_data" compatibility="5.3.008" expanded="true" height="76" name="Materialize Data" width="90" x="447" y="210"/>
          <operator activated="true" class="materialize_data" compatibility="5.3.008" expanded="true" height="76" name="Materialize Data (2)" width="90" x="447" y="30"/>
          <connect from_op="Recall (2)" from_port="result" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Recall" from_port="result" to_op="Neural Net" to_port="training set"/>
          <connect from_op="Neural Net" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Neural Net" from_port="exampleSet" to_op="Materialize Data (2)" to_port="example set input"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Apply Model (2)" from_port="model" to_port="output 4"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="output 2"/>
          <connect from_op="Performance (2)" from_port="example set" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Materialize Data" from_port="example set output" to_port="output 3"/>
          <connect from_op="Materialize Data (2)" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
          <portSpacing port="sink_output 4" spacing="0"/>
          <portSpacing port="sink_output 5" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="training" to_op="Generate Macro (2)" to_port="through 1"/>
      <connect from_op="Generate Macro (2)" from_port="through 1" to_op="Remember (2)" to_port="store"/>
      <connect from_op="Remember (2)" from_port="stored" to_op="Remember (4)" to_port="store"/>
      <connect from_op="Remember (4)" from_port="stored" to_op="Loop" to_port="input 1"/>
      <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop" from_port="output 2" to_port="result 2"/>
      <connect from_op="Loop" from_port="output 3" to_port="result 3"/>
      <connect from_op="Loop" from_port="output 4" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

The first part stores the training and test example sets from inside a normal X-validation which uses a very simple model so there is no hold up as the example sets are partitioned. In addition, an (N+1)th example is created from the full data.

The second part uses a Loop operator to retrieve the training examples, build a model from them and then use the test examples to obtain a performance. It also builds a model on the entire data set from the (N+1)th example and trains it on itself (so it will overfit).

For 10 fold X-Validation there will be 11 entries in each collection returned. The average of the first 10 performances will be the same as the estimated performance from a normal X-Validation. The 11th model would be the one output by a normal X-Validation. The other 10 models are all different and could also be used but it is generally better to use the model made from the most data - in this case the 11th.

You'll notice that I have to use the Materialize Data operator a lot. This generally is needed since without it, the display of example sets can go wrong for reasons I can't explain.

It should be possible to run the second Loop operator in parallel and of course you can modify the process to do what you want.

regards

Andrew

fischer · May 2013

Hi,

some misinformation here: The k+1-th run is executed only if the model output port is connected. Otherwise, there will be only k runs, so at least the first ticket is meanwhile closed already :-) Parallelizing the execution of the k+1th is still a valid feature request though.

Best,
Simon

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

X-Validation runs training X+1 times

Answers