The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
LinearRegression vs W-LinearRegression
Why would I get different coefficients and a lower root mean squared error when using Weka's W-LinearRegression than I would with RM's native LinearRegression?
I have a set of data that I've applied PCA to, and obtained 9 principle components as input to the regression.
I'm using XValidationParallel with 20 validations and shuffled sampling.
Within the XVal node, I'm building either a LinearRegression or W-LinearRegression model, applying it and measuring its performance. The average RMS error is the performance reported
Both regression nodes have attribute selection turned off, and are not trying to eliminate colinear features. The other parameters are at default settings.
The results I'm getting are below. Note that the coefficients are different, as is the RMS error estimates.
I thought that the two models would have yielded near identical results, so I'm confused what's causing the different, and whether I'd be better off using the Weka LinearRegression, as it yielded a lower error.
This is with RM 4.4.
W-LinearRegression
Linear Regression Model
5.5846 * pc_1 +
-1.757 * pc_2 +
-1.018 * pc_3 +
-1.3188 * pc_4 +
0.5875 * pc_5 +
-0.7379 * pc_6 +
3.8062 * pc_7 +
1.3037 * pc_8 +
0.5423 * pc_9 +
-39.8406
root_mean_squared_error: 17.360 +/- 0.512 (mikro: 17.367 +/- 0.000)
LinearRegression
3.547 * pc_1
- 0.473 * pc_2
- 1.579 * pc_3
- 1.314 * pc_4
- 1.693 * pc_5
- 0.131 * pc_6
- 0.111 * pc_7
- 1.802 * pc_8
- 1.016 * pc_9
- 41.004
root_mean_squared_error: 20.991 +/- 0.596 (mikro: 21.001 +/- 0.000)
I have a set of data that I've applied PCA to, and obtained 9 principle components as input to the regression.
I'm using XValidationParallel with 20 validations and shuffled sampling.
Within the XVal node, I'm building either a LinearRegression or W-LinearRegression model, applying it and measuring its performance. The average RMS error is the performance reported
Both regression nodes have attribute selection turned off, and are not trying to eliminate colinear features. The other parameters are at default settings.
The results I'm getting are below. Note that the coefficients are different, as is the RMS error estimates.
I thought that the two models would have yielded near identical results, so I'm confused what's causing the different, and whether I'd be better off using the Weka LinearRegression, as it yielded a lower error.
This is with RM 4.4.
W-LinearRegression
Linear Regression Model
5.5846 * pc_1 +
-1.757 * pc_2 +
-1.018 * pc_3 +
-1.3188 * pc_4 +
0.5875 * pc_5 +
-0.7379 * pc_6 +
3.8062 * pc_7 +
1.3037 * pc_8 +
0.5423 * pc_9 +
-39.8406
root_mean_squared_error: 17.360 +/- 0.512 (mikro: 17.367 +/- 0.000)
LinearRegression
3.547 * pc_1
- 0.473 * pc_2
- 1.579 * pc_3
- 1.314 * pc_4
- 1.693 * pc_5
- 0.131 * pc_6
- 0.111 * pc_7
- 1.802 * pc_8
- 1.016 * pc_9
- 41.004
root_mean_squared_error: 20.991 +/- 0.596 (mikro: 21.001 +/- 0.000)
0
Answers
In the process below, I'm running both LinearRegression and W-LinearRegression on the same dataset.
After running the complete process you will see that the two models created are different.
If you disable the ChangeAttributeRole node and rerun the process, the models will be the same (and the PerformanceVectors will show the same RMS errors).
Incidentally, I ran into another peculiarity with RM -- apparently it doesn't retain an existing PerformanceVector object when another one is created. After the 2nd RegressionPerformance node is executed, the previous PerformanceVector object disappears. I wanted to retain both so the performance of each model could be seen. Instead, I've added a breakpoint after the first RegressionPerformance so the RMS error can be noted. Is this a bug?
Example XML:
thanks for pointing this out. I checked the source code and did another test and found out that the way we invoke Weka's linear regression actually does use the example weights - but Weka handles the weights differently from our implementation. You can see that yourself when you use the following process:
Here, the column "att6" is not used for learning but completely ignored. The resulting performance is for both learners the same, namely 96.794. When you change the role operator to "weight" instead of "ignore", then both performances and models change, not only the RM model. This indicates that both models actually take the weights into account but do that in a different way. The main difference actually is that we use the Java Matrix (Jama) library for the underlying calculations which might perform a bit different, as far as I have seen the actual weight handling is beside that done in exactly the same way.
Cheers,
Ingo
P.S.: The process above makes use of the IO Storate mechanism to prevent the combination of both performance vectors which is actually a desired feature if you want to calculate several different performance measures by different operators and keep track of them in a single vector.
It never occurred to me that the weights were being used in both cases, but in different ways. I also tried running your example in RM 4.3, just to make sure it wasn't a new problem with 4.4, and both versions produced the same results. I searched Weka's documentation and mailing list archives and couldn't find a clear explanation of how they apply weights.
I am still a bit troubled by the different results. My understanding of ordinary least squares linear regression is that it is calculated analytically -- you will always get the same answer on the same data -- rather than numerically with iterative methods which are not exact, and continue processing until an answer is "close enough", which could yield different results on different runs perhaps with different random number seeds.
Weighted least squares linear regression, at least according to sources like http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd432.htm, should also yield identical exact results on identical data. So I'm still puzzled while RM's LinearRegression and Weka's Linear Regression would yield different results with the same weighted example data. The differences in results intuitively seem too big to be explained by roundoff errors from different matrix calculations. But my intuition isn't always accurate either. :-)
Any further ideas on what's going on?
Cheers,
Ingo
It only seems to allow one X variable, so I modified the RM process to use just att1 to predict the label with att6 as the weights. The results from Excel matched the Weka results, not the RM results. This leads me to think the bug might be on the RM side, although it's possible that both Weka and the author of the Excel code made the same error.
We will check this and write back as soon as we have more information about this.
Cheers,
Ingo
good news: I found the reason for the difference. Before the linear regression is performed, both tools (RM and Weka) normalize the data to mean 0 and deviation 1. During this normalization, Weka takes into account the example weights which is not done by RapidMiner - hence the difference. Although I am not sure if this is always desired (probably yes) we decided to also perform a weighted normalization just for having things comparable with Weka (and also Excel ).
We will include this fix then into the next update.
Cheers,
Ingo
Great news, Ingo, and thanks for the quick response and fix. I look forward to getting it in the next RM EE update.