question marks in linear regression output

AD2019 · November 2019

I ran a linear regression model with 18 independent variables and feature selection turned off. For some of the independent variables there were question marks for the standard error of the estimate, and therefore for the t-statistic and p-value for the coefficient. I ran the mode again with feature selection turned on and got the same question marks. What do these question marks mean? Thay cannot have anything to do with missing values as the regression would not have run to completion in that case. I am baffled about what these "?" symbols might mean. Help.....

varunm1 · November 2019

Hello @sgenzer and @AD2019

I tried to look at H2O documentation on linear regression, unfortunately, I found none. For GLM to provide p-values, there is a mandatory parameter selection that H2O recommends to get values without "?" (Unknown)

1. You should uncheck the " Use Regularization" option.
2. You should select "Add intercept"
3. You should select " compute p-values"
4. You should select " remove collinear columns"

If these are set then you will get the p values, std.error, etc without question marks. You will get question marks in this case only when the coefficient is 0.

I will see if I can find any information on linear regression.

sgenzer · November 2019

thank you @varunm1!

Telcontar120 · November 2019

Can you post your process xml? Do you have the bias parameter checked in the LR operator or the exclude collinear features? There are several options that can affect the output.

AD2019 · November 2019

Hi, I have attached my process rmp file. the 'exclude collinear features' is unchecked. and you are correct about the bias thing. if 'use bias' is checked, i do not get question marks. if it is unchecked, i do get question marks. I did all this with 'feature selection' turned off. Something else is also strange. I then turned on feature selection and used T_Test as the selection method with alpha set to 0.05. I got a solution that included Independent variables with p-value much much higher than 0.05. I am confused why these IVs were not trimmed from the output. thanks in advance for your help.

AD2019 · November 2019

by the way, regardless of the cause, I would like to know what the question mark in the regression output is trying to communicate to the user. does it mean a computational underflow or overflow or a computational error or what?

sgenzer · November 2019

hi @AD2019 I'm picking up this thread here. I have your process (thank you) but not the data set - hence I cannot run the process. Can you pls post?

AD2019 · November 2019

my apologies for this delay in posting the data file. please see attached. when i run the regression without bias, I get question marks in the regression model. What does that mean? the process files was posted earlier (RM-houseprice-process.rmp).

sgenzer · November 2019

hi @AD2019 do you mean these ? marks?

Image: https://us.v-cdn.net/6030995/uploads/editor/xk/x92tv27br21t.png

So the simple answer is that ? marks are used in RapidMiner when values are missing. The better question is why are they missing...my educated guess here (pls correct me @varunm1 @mschmitz if my stats are wrong here) is that there can be no std coefficient or tolerance for an intercept of a LinReg model as it's a computed value. All of your actual data (the other attributes) have std coefficients which make sense. But my stats are a wee bit rusty so I look to these other smart folks to correct me.

Scott

AD2019 · November 2019

Hi Scott:

if you run the process with bias turned off, you will get questions marks for some of the independent variables as well, not just the intercept. Since there is a question mark on the standard error for these variables, the t-statistic and p-values also have question marks on them. So it is not just an issue of the intercept. The data set does not have missing values, so I could not figure out what the question marks were trying to say. The only thing I could think of was numerical overflow or underflow when calculating the standard error of the associated variable, but then I could not see how the coefficients would have been computed.

Amit

sgenzer · November 2019

hi Amit -

Ah I understand. Good point. It's been a while since I've played with all of this (we normally use the GLM modeler instead of LinReg as it is far more versatile and robust). Let me investigate.

Scott

AD2019 · November 2019

thanks Scott. Let me play around with GLM and see if I can get rid of the ?

AD2019 · November 2019

thank you Varun.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

question marks in linear regression output

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers