"Unexpected Regression Performance Using Cross-Validation"

ikhwan · August 2010

Hi all,

I tried to do SVM Regression using LibSVM. When I measure the performance of this learner without cross validation (using the whole data as training set), it gives the following results:

absolute_error: 8618.717 +/- 19520.661
relative_error: 102.25% +/- 631.07%
correlation: 0.873
prediction_average: 35706.987 +/- 42654.440

However, when I add 10-fold cross validation in the workflow, I got really different result:

absolute_error: 28596.955 +/- 3938.106 (mikro: 28591.849 +/- 30064.573)
relative_error: 395.80% +/- 192.38% (mikro: 395.36% +/- 1,329.27%)
correlation: 0.320 +/- 0.126 (mikro: 0.303)
prediction_average: 35707.687 +/- 5282.379 (mikro: 35706.987 +/- 42654.440)

Is it normal to face this kind of situation, especailly when we use SVM Regression?
Is there any way to improve this performance?

FYI, the dataset consists of around 500 instances with 80 attributes. Originally it only has 6 attributes, two of them are textual and I converted to WordVector (TF-IDF) and the rest are nominal which are converted into binary.

For the learner, I use epsilon-SVR with gamma = 1.0 and C = 100000.0. Those parameter are the results of Optimization process.

Thanks in advance.

Cheers,
Ikhwan

This is the XML file for the cross-validation:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="449" width="614">
      <operator activated="true" class="read_arff" compatibility="5.0.8" expanded="true" height="60" name="Read ARFF" width="90" x="66" y="115">
        <parameter key="data_file" value="/home/kosumo/example-set-jboss.arff"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="246" y="30">
        <parameter key="name" value="timespent_sec"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.8" expanded="true" height="112" name="Validation" width="90" x="380" y="165">
        <parameter key="parallelize_training" value="true"/>
        <parameter key="parallelize_testing" value="true"/>
        <process expanded="true" height="408" width="276">
          <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.8" expanded="true" height="76" name="SVM" width="90" x="112" y="75">
            <parameter key="svm_type" value="epsilon-SVR"/>
            <parameter key="gamma" value="1.0"/>
            <parameter key="C" value="100000.0"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="training" to_op="SVM" to_port="training set"/>
          <connect from_op="SVM" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="408" width="279">
          <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="45" y="210">
            <parameter key="absolute_error" value="true"/>
            <parameter key="relative_error" value="true"/>
            <parameter key="correlation" value="true"/>
            <parameter key="prediction_average" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read ARFF" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 2"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

haddock · August 2010

Hi there,

A question, when you say...

For the learner, I use epsilon-SVR with gamma = 1.0 and C = 100000.0. Those parameter are the results of Optimization process.

How did you do the optimisation, and on what data? The reason I ask is that overtraining with SVMs is a well-known pitfall, and this issue keeps popping up.

ikhwan · August 2010

Many thanks for your reply. Yeah.. since I only have limited data, I use the same data for optimization process.
Do you have any suggestion for this situation? Should I split my data, but how much should I give for optimization?

For optimization, I just follow one workflow discussed previously in the forum. This is the XML file:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Root">
    <description>&lt;p&gt; Often the different operators have many parameters and it is not clear which parameter values are best for the learning task at hand. The parameter optimization operator helps to find an optimal parameter set for the used operators. &lt;/p&gt;  &lt;p&gt; The inner crossvalidation estimates the performance for each parameter set. In this process two parameters of the SVM are tuned. The result can be plotted in 3D (using gnuplot) or in color mode. &lt;/p&gt;  &lt;p&gt; Try the following: &lt;ul&gt; &lt;li&gt;Start the process. The result is the best parameter set and the performance which was achieved with this parameter set.&lt;/li&gt; &lt;li&gt;Edit the parameter list of the ParameterOptimization operator to find another parameter set.&lt;/li&gt; &lt;/ul&gt; &lt;/p&gt; </description>
    <process expanded="true" height="408" width="603">
      <operator activated="true" class="read_arff" compatibility="5.0.8" expanded="true" height="60" name="Read ARFF" width="90" x="45" y="75">
        <parameter key="data_file" value="/home/kosumo/postgre-SVM-wordToVector-noCharacters.arff"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="179" y="120">
        <parameter key="name" value="timespent_sec"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="XValidation" width="90" x="313" y="75">
        <parameter key="parallelize_training" value="true"/>
        <parameter key="parallelize_testing" value="true"/>
        <process expanded="true" height="408" width="280">
          <operator activated="true" class="optimize_parameters_grid" compatibility="5.0.0" expanded="true" height="112" name="loopThroughLocalParams" width="90" x="45" y="30">
            <list key="parameters">
              <parameter key="trainWithLocalParams.C" value="1024,32768,1048576,33554432,1073741824"/>
              <parameter key="trainWithLocalParams.gamma" value="0.00000095367431640625,0.0000019073486328125,0.000003814697265625,0.00000762939453125,0.0000152587890625,0.000030517578125,0.0009765625,0.03125,1,16384"/>
            </list>
            <parameter key="parallelize_optimization_process" value="true"/>
            <process expanded="true" height="390" width="585">
              <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="crossEvalLocalParams" width="90" x="45" y="30">
                <parameter key="number_of_validations" value="2"/>
                <parameter key="parallelize_training" value="true"/>
                <parameter key="parallelize_testing" value="true"/>
                <process expanded="true" height="408" width="276">
                  <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.0" expanded="true" height="76" name="trainWithLocalParams" width="90" x="45" y="30">
                    <parameter key="svm_type" value="epsilon-SVR"/>
                    <parameter key="gamma" value="0.0000019073486328125"/>
                    <parameter key="C" value="32768"/>
                    <parameter key="epsilon" value="0.01"/>
                    <list key="class_weights"/>
                  </operator>
                  <connect from_port="training" to_op="trainWithLocalParams" to_port="training set"/>
                  <connect from_op="trainWithLocalParams" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="408" width="279">
                  <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Test" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance_regression" compatibility="5.0.8" expanded="true" height="76" name="Performance (2)" width="90" x="112" y="165">
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="relative_error" value="true"/>
                    <parameter key="correlation" value="true"/>
                    <parameter key="prediction_average" value="true"/>
                  </operator>
                  <connect from_port="model" to_op="Test" to_port="model"/>
                  <connect from_port="test set" to_op="Test" to_port="unlabelled data"/>
                  <connect from_op="Test" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
                  <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="crossEvalLocalParams" to_port="training"/>
              <connect from_op="crossEvalLocalParams" from_port="training" to_port="result 1"/>
              <connect from_op="crossEvalLocalParams" from_port="averagable 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_parameters" compatibility="5.0.0" expanded="true" height="60" name="ParameterSetter" width="90" x="179" y="30">
            <list key="name_map">
              <parameter key="trainWithLocalParams" value="optimizedLearner"/>
            </list>
          </operator>
          <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.0" expanded="true" height="76" name="optimizedLearner" width="90" x="112" y="255">
            <parameter key="svm_type" value="epsilon-SVR"/>
            <parameter key="gamma" value="0.03125"/>
            <parameter key="C" value="1048576"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="training" to_op="loopThroughLocalParams" to_port="input 1"/>
          <connect from_op="loopThroughLocalParams" from_port="parameter" to_op="ParameterSetter" to_port="parameter set"/>
          <connect from_op="loopThroughLocalParams" from_port="result 1" to_op="optimizedLearner" to_port="training set"/>
          <connect from_op="optimizedLearner" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="408" width="279">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="applyModel" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="store" compatibility="5.0.8" expanded="true" height="60" name="Store" width="90" x="179" y="30">
            <parameter key="repository_entry" value="SVM_output_model"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="112" y="165">
            <parameter key="absolute_error" value="true"/>
            <parameter key="relative_error" value="true"/>
            <parameter key="correlation" value="true"/>
            <parameter key="prediction_average" value="true"/>
          </operator>
          <operator activated="true" class="store" compatibility="5.0.8" expanded="true" height="60" name="Store (2)" width="90" x="149" y="277">
            <parameter key="repository_entry" value="SVM_prediction"/>
          </operator>
          <connect from_port="model" to_op="applyModel" to_port="model"/>
          <connect from_port="test set" to_op="applyModel" to_port="unlabelled data"/>
          <connect from_op="applyModel" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="applyModel" from_port="model" to_op="Store" to_port="input"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <connect from_op="Performance" from_port="example set" to_op="Store (2)" to_port="input"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read ARFF" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="XValidation" to_port="training"/>
      <connect from_op="XValidation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Unexpected Regression Performance Using Cross-Validation"

Answers