LinearRegression vs W-LinearRegression

keith · March 2009

Why would I get different coefficients and a lower root mean squared error when using Weka's W-LinearRegression than I would with RM's native LinearRegression?

I have a set of data that I've applied PCA to, and obtained 9 principle components as input to the regression.

I'm using XValidationParallel with 20 validations and shuffled sampling.

Within the XVal node, I'm building either a LinearRegression or W-LinearRegression model, applying it and measuring its performance. The average RMS error is the performance reported

Both regression nodes have attribute selection turned off, and are not trying to eliminate colinear features. The other parameters are at default settings.

The results I'm getting are below. Note that the coefficients are different, as is the RMS error estimates.

I thought that the two models would have yielded near identical results, so I'm confused what's causing the different, and whether I'd be better off using the Weka LinearRegression, as it yielded a lower error.

This is with RM 4.4.

W-LinearRegression
Linear Regression Model

5.5846 * pc_1 +
-1.757 * pc_2 +
-1.018 * pc_3 +
-1.3188 * pc_4 +
0.5875 * pc_5 +
-0.7379 * pc_6 +
3.8062 * pc_7 +
1.3037 * pc_8 +
0.5423 * pc_9 +
-39.8406

root_mean_squared_error: 17.360 +/- 0.512 (mikro: 17.367 +/- 0.000)

LinearRegression

3.547 * pc_1
- 0.473 * pc_2
- 1.579 * pc_3
- 1.314 * pc_4
- 1.693 * pc_5
- 0.131 * pc_6
- 0.111 * pc_7
- 1.802 * pc_8
- 1.016 * pc_9
- 41.004

root_mean_squared_error: 20.991 +/- 0.596 (mikro: 21.001 +/- 0.000)

earmijo · March 2009

Could you post the dataset (to see if I can replicate the difference) or is it to big?

keith · March 2009

It's too big (18,000+ rows), and not publically shareable anyways. I'll see if I can create an different set that can reproduce the problem.

keith · March 2009

I found the problem. W-LinearRegression, as implemented in RM, doesn't handle example weights, even though it's documented as doing so. Weka's own documentation suggests that it can handle weights, but perhaps RM isn't calling it the right way?

In the process below, I'm running both LinearRegression and W-LinearRegression on the same dataset.

After running the complete process you will see that the two models created are different.

If you disable the ChangeAttributeRole node and rerun the process, the models will be the same (and the PerformanceVectors will show the same RMS errors).

Incidentally, I ran into another peculiarity with RM -- apparently it doesn't retain an existing PerformanceVector object when another one is created. After the 2nd RegressionPerformance node is executed, the previous PerformanceVector object disappears. I wanted to retain both so the performance of each model could be seen. Instead, I've added a breakpoint after the first RegressionPerformance so the RMS error can be noted. Is this a bug?

Example XML:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="polynomial"/>
        <parameter key="number_of_attributes"	value="6"/>
        <parameter key="attributes_lower_bound"	value="1.0"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="att6"/>
        <parameter key="target_role"	value="weight"/>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="LinearRegression" class="LinearRegression">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="feature_selection"	value="none"/>
        <parameter key="eliminate_colinear_features"	value="false"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <parameter key="keep_model"	value="true"/>
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="RegressionPerformance" class="RegressionPerformance" breakpoints="after">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="root_mean_squared_error"	value="true"/>
    </operator>
    <operator name="W-LinearRegression" class="W-LinearRegression">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="S"	value="1.0"/>
        <parameter key="C"	value="true"/>
    </operator>
    <operator name="ModelApplier (2)" class="ModelApplier">
        <parameter key="keep_model"	value="true"/>
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="RegressionPerformance (2)" class="RegressionPerformance">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="root_mean_squared_error"	value="true"/>
    </operator>
</operator>

IngoRM · March 2009

Hi Keith,

thanks for pointing this out. I checked the source code and did another test and found out that the way we invoke Weka's linear regression actually does use the example weights - but Weka handles the weights differently from our implementation. You can see that yourself when you use the following process:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="polynomial"/>
        <parameter key="number_of_attributes"	value="6"/>
        <parameter key="attributes_lower_bound"	value="1.0"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="att6"/>
        <parameter key="target_role"	value="ignore"/>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="LinearRegression" class="LinearRegression">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="feature_selection"	value="none"/>
        <parameter key="eliminate_colinear_features"	value="false"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <parameter key="keep_model"	value="true"/>
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="RegressionPerformance" class="RegressionPerformance">
        <parameter key="root_mean_squared_error"	value="true"/>
    </operator>
    <operator name="IOStorer" class="IOStorer">
        <parameter key="name"	value="first_performance"/>
        <parameter key="io_object"	value="PerformanceVector"/>
    </operator>
    <operator name="W-LinearRegression" class="W-LinearRegression">
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="S"	value="1.0"/>
        <parameter key="C"	value="true"/>
    </operator>
    <operator name="ModelApplier (2)" class="ModelApplier">
        <parameter key="keep_model"	value="true"/>
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="RegressionPerformance (2)" class="RegressionPerformance">
        <parameter key="root_mean_squared_error"	value="true"/>
    </operator>
    <operator name="IORetriever" class="IORetriever">
        <parameter key="name"	value="first_performance"/>
        <parameter key="io_object"	value="PerformanceVector"/>
    </operator>
</operator>

Here, the column "att6" is not used for learning but completely ignored. The resulting performance is for both learners the same, namely 96.794. When you change the role operator to "weight" instead of "ignore", then both performances and models change, not only the RM model. This indicates that both models actually take the weights into account but do that in a different way. The main difference actually is that we use the Java Matrix (Jama) library for the underlying calculations which might perform a bit different, as far as I have seen the actual weight handling is beside that done in exactly the same way.

Cheers,
Ingo

P.S.: The process above makes use of the IO Storate mechanism to prevent the combination of both performance vectors which is actually a desired feature if you want to calculate several different performance measures by different operators and keep track of them in a single vector.

keith · March 2009

Thanks for the explanation, Ingo.

It never occurred to me that the weights were being used in both cases, but in different ways. I also tried running your example in RM 4.3, just to make sure it wasn't a new problem with 4.4, and both versions produced the same results. I searched Weka's documentation and mailing list archives and couldn't find a clear explanation of how they apply weights.

I am still a bit troubled by the different results. My understanding of ordinary least squares linear regression is that it is calculated analytically -- you will always get the same answer on the same data -- rather than numerically with iterative methods which are not exact, and continue processing until an answer is "close enough", which could yield different results on different runs perhaps with different random number seeds.

Weighted least squares linear regression, at least according to sources like http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd432.htm, should also yield identical exact results on identical data. So I'm still puzzled while RM's LinearRegression and Weka's Linear Regression would yield different results with the same weighted example data. The differences in results intuitively seem too big to be explained by roundoff errors from different matrix calculations. But my intuition isn't always accurate either. :-)

Any further ideas on what's going on?

IngoRM · March 2009

The differences in results intuitively seem too big to be explained by roundoff errors from different matrix calculations.

I would definitely agree on that. We will check as soon as we found a calm hour if we can find out where are the exact differences.

Cheers,
Ingo

keith · March 2009

Following up on my own question, I found some code that allows you to do weighted least squares linear regression in Excel as a third check on RM and Weka. The code is here: http://www.adamslim.com/ModellingGuides/ModellingGuidesWLR.htm

It only seems to allow one X variable, so I modified the RM process to use just att1 to predict the label with att6 as the weights. The results from Excel matched the Weka results, not the RM results. This leads me to think the bug might be on the RM side, although it's possible that both Weka and the author of the Excel code made the same error.

IngoRM · March 2009

Ok, that brings this issue right on top of our list. Fixing a definite (or at least: very likely) bug is of course a top priority

We will check this and write back as soon as we have more information about this.

Cheers,
Ingo

IngoRM · March 2009

Hi Keith,

good news: I found the reason for the difference. Before the linear regression is performed, both tools (RM and Weka) normalize the data to mean 0 and deviation 1. During this normalization, Weka takes into account the example weights which is not done by RapidMiner - hence the difference. Although I am not sure if this is always desired (probably yes) we decided to also perform a weighted normalization just for having things comparable with Weka (and also Excel

).

We will include this fix then into the next update.

Cheers,
Ingo

keith · March 2009

If Excel does it, it must be right. :-)

Great news, Ingo, and thanks for the quick response and fix. I look forward to getting it in the next RM EE update.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

LinearRegression vs W-LinearRegression

Answers