Linear Regression: error in calculation of tolerance
I am writing training materials for multiple regression. The Linear Regression Operator is giving what seems to be incorrect calculations for tolerance.
To illustrate, see attached toy dataset. My process reads this data and uses Linear Regression to do y=f(x1, x2, x3, x4). The model is then applied to the training data (just to keep things simple) and finally I use Performance to get R-squared. The result is:
Attribute Coefficient Standard Error Std. Coefficient Tolerance t-stat p-value code
X1 | 0.6099442233747938 | 0.097076731571145 | 0.8324180612316422 | 0.4913830335394965 | 6.283114537367604 | 1.4384283423596322E-4 | **** |
X2 | -2.8474043342377822E-8 | 1.9598479705266512E-7 | -0.028568714232080603 | 0.40108726248304105 | 0.0 | 1.0 | |
X3 | 0.178312419929975 | 0.0821213306746008 | 0.7990271382036194 | 0.4534020133333492 | 2.1713289161925995 | 0.05798784094691456 | * |
X4 | -0.0010830494516547503 | 7.82512989580685E-4 | -0.49206399607097406 | 0.262094151203384 | -1.3840657804736376 | 0.19969313341637596 | |
(Intercept) | -0.3277299280807463 | 0.161204140113176 | NaN | NaN | -2.033011855965102 | 0.07258034063737584 | * |
I cross check the results with Minitab and RapidMiner and Minitab agree on everything except tolerance. Minitab reports VIFs but they are simply the reciprocal of tolerance. Here is the Minitab output
Term Coef SE Coef T-Value P-Value VIF
Constant -0.328 0.161 -2.03 0.073
x1 0.6099 0.0971 6.28 0.000 2.53
x2 -0.000000 0.000000 -0.15 0.888 5.58
x3 0.1783 0.0821 2.17 0.058 19.54
x4 -0.001083 0.000783 -1.38 0.200 18.24
The VIFs are a long way from the reciprocals of the tolerances.
I calculated the values directly: tolerance = 1-R-sq, where R-sq is obtained by regressing the x against all the other xs. So for example if I drop the y and make x4 the label and re-run the process, I get an R-sq of 94.5% and the tolerance for x4 should therefore be 0.055, not 0.262
Am I going wrong, or is it an error?
Many thanks
David Hampton
Answers
Hey David,
i've dived into the code and saw no real issue except forpossible numeric instabilities. Did you check to normalize first and compare the results?
~Martin
Dortmund, Germany
Or were there any other parameters modified (e.g. ridge regression value) that might be affecting the calculation?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Many thanks for your prompt reply Martin. I have checked this... normalizing changes all the coefficients and their standard errors, as you would expect, but does not affect tolerances (or p-values for that matter) so it's not being caused by that.
Attribute Coefficient Standard Error Std. Coefficient tolerance t-stat p-value
X1
1.8298325346724833
0.29123019471343986
0.8324179996125617
0.4913830514888129
6.283114072264977
1.438429138445052E-4
****
X2
-0.04666048376439675
0.321161266854193
-0.028568669272726246
0.40108725714143556
-0.14528677203649398
0.8876862107223876
X3
1.2481866266339015
0.5748493147222091
0.7990269379160296
0.4534020060669054
2.1713283719179115
0.05798789235186819
*
X4
-0.9021798318881504
0.6518333203207141
-0.4920637989900229
0.2620941472452639
-1.3840652261290682
0.1996932973423048
(Intercept)
0.3846904324227236
0.1089314893504385
NaN
NaN
3.531489697943573
0.0063989680350855505
***
A simple check to see if there is indeed something wrong is to directly calculate the tolerance: I re-ran the regression model without y and instead made x4 the label. This directly calculates the R-sq of x4 against all the other attributes. I get an r-squared of 0.954 and from that I can calculate that the tolerance of X4 should be 1-0.954 = 0.046 ... a long way from the figure RapidMiner gives, of 0.262.
Thanks for your patience with this...
David
Thanks Brian
For training purposes I begin with no feature selection, no elimination of collinear features and no regularisation. Adding in either feature selection or removal of collinear features sweeps away some of the xs and so masks the problem with the tolerance calculations (but doesn't solve it!)... adding in regularisation makes only a very small difference - even with a ridge of 0.1 the tolerances reduce by only about 15-20% and they are several times too big... so it's not that.
cheers
David
David,
i've checked the code, which i attach here. It looks super good. I know that our LinReg got benchmarked a lot against e.g. R and went well. Did you compare it to some other tool and are you sure about your VIF interpetation? Maybe @DArnu can help. He got some background here..
~Martin
Dortmund, Germany
Many thanks Martin.
I have checked using R with the car package to get VIFs. The coefficients stack up exactly with RapidMiner and R gives the same VIFs as Minitab (ie, contradicting RapidMiner)
Here's my R output:
> summary(book1Model)
Call:
lm(formula = Y ~ ., data = trial)
Residuals:
Min 1Q Median 3Q Max
-0.18858 -0.03629 -0.01287 0.02995 0.38796
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.277e-01 1.612e-01 -2.033 0.072580 .
X1 6.099e-01 9.708e-02 6.283 0.000144 ***
X2 -2.847e-08 1.960e-07 -0.145 0.887686
X3 1.783e-01 8.212e-02 2.171 0.057988 .
X4 -1.083e-03 7.825e-04 -1.384 0.199693
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1671 on 9 degrees of freedom
Multiple R-squared: 0.9376, Adjusted R-squared: 0.9099
F-statistic: 33.82 on 4 and 9 DF, p-value: 1.973e-05
> vif(book1Model)
X1 X2 X3 X4
2.532610 5.579088 19.539216 18.237488
So assuming that RapidMiner's code is OK, something must be wrong with my Linear Regression operator. I deleted and replaced it, no change.
For clarity the parameter setting I am using are:
Feature selection: none
Do not eliminate collinear features
Use bias
Ridge 0
I believe this should get me equivalent output to R and Minitab. Still the same error. I must be doing something wrong but feel that I have pretty much exhausted the possibilities!
thanks
David