Linear Regression: error in calculation of tolerance

dhampton · January 2017

I am writing training materials for multiple regression. The Linear Regression Operator is giving what seems to be incorrect calculations for tolerance.

To illustrate, see attached toy dataset. My process reads this data and uses Linear Regression to do y=f(x1, x2, x3, x4). The model is then applied to the training data (just to keep things simple) and finally I use Performance to get R-squared. The result is:

Attribute Coefficient Standard Error Std. Coefficient Tolerance t-stat p-value code

X1	0.6099442233747938	0.097076731571145	0.8324180612316422	0.4913830335394965	6.283114537367604	1.4384283423596322E-4	****
X2	-2.8474043342377822E-8	1.9598479705266512E-7	-0.028568714232080603	0.40108726248304105	0.0	1.0
X3	0.178312419929975	0.0821213306746008	0.7990271382036194	0.4534020133333492	2.1713289161925995	0.05798784094691456	*
X4	-0.0010830494516547503	7.82512989580685E-4	-0.49206399607097406	0.262094151203384	-1.3840657804736376	0.19969313341637596
(Intercept)	-0.3277299280807463	0.161204140113176	NaN	NaN	-2.033011855965102	0.07258034063737584	*

I cross check the results with Minitab and RapidMiner and Minitab agree on everything except tolerance. Minitab reports VIFs but they are simply the reciprocal of tolerance. Here is the Minitab output

Term Coef SE Coef T-Value P-Value VIF
Constant -0.328 0.161 -2.03 0.073
x1 0.6099 0.0971 6.28 0.000 2.53
x2 -0.000000 0.000000 -0.15 0.888 5.58
x3 0.1783 0.0821 2.17 0.058 19.54
x4 -0.001083 0.000783 -1.38 0.200 18.24

The VIFs are a long way from the reciprocals of the tolerances.

I calculated the values directly: tolerance = 1-R-sq, where R-sq is obtained by regressing the x against all the other xs. So for example if I drop the y and make x4 the label and re-run the process, I get an R-sq of 94.5% and the tolerance for x4 should therefore be 0.055, not 0.262

Am I going wrong, or is it an error?

Many thanks

David Hampton

MartinLiebig · January 2017

Hey David,

i've dived into the code and saw no real issue except forpossible numeric instabilities. Did you check to normalize first and compare the results?

~Martin

Telcontar120 · January 2017

Or were there any other parameters modified (e.g. ridge regression value) that might be affecting the calculation?

dhampton · January 2017

Many thanks for your prompt reply Martin. I have checked this... normalizing changes all the coefficients and their standard errors, as you would expect, but does not affect tolerances (or p-values for that matter) so it's not being caused by that.

Attribute Coefficient Standard Error Std. Coefficient tolerance t-stat p-value

X1	1.8298325346724833	0.29123019471343986	0.8324179996125617	0.4913830514888129	6.283114072264977	1.438429138445052E-4	****
X2	-0.04666048376439675	0.321161266854193	-0.028568669272726246	0.40108725714143556	-0.14528677203649398	0.8876862107223876
X3	1.2481866266339015	0.5748493147222091	0.7990269379160296	0.4534020060669054	2.1713283719179115	0.05798789235186819	*
X4	-0.9021798318881504	0.6518333203207141	-0.4920637989900229	0.2620941472452639	-1.3840652261290682	0.1996932973423048
(Intercept)	0.3846904324227236	0.1089314893504385	NaN	NaN	3.531489697943573	0.0063989680350855505	***

A simple check to see if there is indeed something wrong is to directly calculate the tolerance: I re-ran the regression model without y and instead made x4 the label. This directly calculates the R-sq of x4 against all the other attributes. I get an r-squared of 0.954 and from that I can calculate that the tolerance of X4 should be 1-0.954 = 0.046 ... a long way from the figure RapidMiner gives, of 0.262.

Thanks for your patience with this...

David

dhampton · January 2017

Thanks Brian

For training purposes I begin with no feature selection, no elimination of collinear features and no regularisation. Adding in either feature selection or removal of collinear features sweeps away some of the xs and so masks the problem with the tolerance calculations (but doesn't solve it!)... adding in regularisation makes only a very small difference - even with a ridge of 0.1 the tolerances reduce by only about 15-20% and they are several times too big... so it's not that.

cheers

David

MartinLiebig · January 2017

David,

i've checked the code, which i attach here. It looks super good. I know that our LinReg got benchmarked a lot against e.g. R and went well. Did you compare it to some other tool and are you sure about your VIF interpetation? Maybe @DArnu can help. He got some background here..

~Martin

	double getTolerance(ExampleSet exampleSet, boolean[] isUsedAttribute, int testAttributeIndex, double ridge,
			boolean useIntercept) throws UndefinedParameterError, ProcessStoppedException {
		List<Attribute> attributeList = new LinkedList<>();
		Attribute currentAttribute = null;
		int resultAIndex = 0;
		for (Attribute a : exampleSet.getAttributes()) {
			if (isUsedAttribute[resultAIndex]) {
				if (resultAIndex != testAttributeIndex) {
					attributeList.add(a);
				} else {
					currentAttribute = a;
				}
			}
			resultAIndex++;
		}

		Attribute[] usedAttributes = new Attribute[attributeList.size()];
		attributeList.toArray(usedAttributes);

		double[] localCoefficients = performRegression(exampleSet, usedAttributes, currentAttribute, ridge);
		double[] attributeValues = new double[exampleSet.size()];
		double[] predictedValues = new double[exampleSet.size()];
		int eIndex = 0;
		for (Example e : exampleSet) {
			attributeValues[eIndex] = e.getValue(currentAttribute);
			int aIndex = 0;
			double prediction = 0.0d;
			for (Attribute a : usedAttributes) {
				prediction += localCoefficients[aIndex] * e.getValue(a);
				aIndex++;
			}
			if (useIntercept) {
				prediction += localCoefficients[localCoefficients.length - 1];
			}
			predictedValues[eIndex] = prediction;
			eIndex++;
		}

		double correlation = MathFunctions.correlation(attributeValues, predictedValues);
		double tolerance = 1.0d - correlation * correlation;
		return tolerance;
	}

dhampton · January 2017

Many thanks Martin.

I have checked using R with the car package to get VIFs. The coefficients stack up exactly with RapidMiner and R gives the same VIFs as Minitab (ie, contradicting RapidMiner)

Here's my R output:

> summary(book1Model)

Call:
lm(formula = Y ~ ., data = trial)

Residuals:
Min 1Q Median 3Q Max
-0.18858 -0.03629 -0.01287 0.02995 0.38796

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.277e-01 1.612e-01 -2.033 0.072580 .
X1 6.099e-01 9.708e-02 6.283 0.000144 ***
X2 -2.847e-08 1.960e-07 -0.145 0.887686
X3 1.783e-01 8.212e-02 2.171 0.057988 .
X4 -1.083e-03 7.825e-04 -1.384 0.199693
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1671 on 9 degrees of freedom
Multiple R-squared: 0.9376, Adjusted R-squared: 0.9099
F-statistic: 33.82 on 4 and 9 DF, p-value: 1.973e-05

> vif(book1Model)
X1 X2 X3 X4
2.532610 5.579088 19.539216 18.237488

So assuming that RapidMiner's code is OK, something must be wrong with my Linear Regression operator. I deleted and replaced it, no change.

For clarity the parameter setting I am using are:

Feature selection: none

Do not eliminate collinear features

Use bias

Ridge 0

I believe this should get me equivalent output to R and Minitab. Still the same error. I must be doing something wrong but feel that I have pretty much exhausted the possibilities!

thanks

David

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Linear Regression: error in calculation of tolerance

Answers