Learn moreLearn moreApplied Statistics Handbook

Table of Contents

 


 

Multiple Regression Assumptions

 

No measurement error

 

Independent (X) and dependent (Y) variables are accurately measured: IV- any measurement error will bias the estimates.   DV-may be unbiased if the error is random.   The consequences of random error in an Independent Variable may be lower R2, partial slope coefficients can vary dramatically depending on the amount of random error in the independent variables and the partial slope coefficients of independent variables that do not have random measurement error will be biased if they are correlated with another independent variable with measurement error. 

 

 

No specification error

 

The theoretical model is linear, additive, and includes the correct variables.  Linear implies that the average change in the dependent variable associated with a one unit change in an independent variable is constant regardless of the level of the independent variable. If the partial slope for X is not constant for differing values of X, X has a nonlinear relationship with Y and results in biased partial slopes. 

 

Non-Linearity

 

Not Linear but slope does not change direction.

 

Correction: A log-log model takes a nonlinear specification where the slope changes as the value of X increases and makes it linear in terms of interpreting the parameter estimates. It accomplishes this by transforming the dependent and all independent variables by taking their log and replacing the original variables with the logged variables.  The result is coefficients that are interpreted as a % change in Y given a 1% change in X.

 

 

Example:   log Y= a + b1logX1 + b2logX2 + b3logX3 + log e 

 

Note:  the data must have positive values.  A log of a negative value will equal 0 and as a result biases the model.  Use the anti log of coefficients to estimate y.

 

 

Not linear and the slope changes direction (pos or neg)

 


Correction: A polynomial model may be used to correct for changes in the slope coefficient sign (positive or negative):  This is accomplished by adding additional variables that are incremental powers of the independent variable to model bends in the slope. 

 

Example:         Y = a + b1X1 + b2X22 + b3X33 + e 

 

 

Non-Additivity

 

Additive implies that the average change in the dependent variable associated with a one unit change in an independent variable (X1) is constant regardless of the value of another independent variable (X2) in the model.   If this assumption is violated, we can no longer interpret the slope by saying "holding other variables constant" since the values of the other variables may possibly change the slope coefficient and therefore its interpretation.

 


 

The figure above displays a non-additive relationship when (X1) is interval/ratio and (X2) is a dummy variable.  If the partial slope for (X1) is not constant for differing values of (X2), (X1) and (X2) do not have an additive relationship with Y. 

 

Correction:  An interaction term may be added using a dummy variable where the slope of X1 is thought to depend on the value of a dummy variable X2. The final model will look like the following:

 

Model:            Y = a + b1X1 + b2X1X2 + e  

 

where             X1X2=the interaction between X1 and X2 or X1*X2

 

 

Interpretation:

 

b1 is interpreted as the slope for X1 when the dummy variable (X2) is 0

 

b1+ b2 is interpreted as the slope for X1 when the dummy variable (X2) is 1

 

Correction: Using a multiplicative model for two interval independent variables

 

Used with two interval level independent variables that are thought to interact in how they influence Y.

 

Model without interaction term         Y = a + b1X1 +b2X2 + e              

 

Model with interaction term              Y = a + b1X1 +b2X2 ++ b3X1X2 + e  

 

where               b3 = X1*X2         is the interactive term

 

 

Incorrect Independent Variables

 

Including the correct independent variables implies that an irrelevant variable has not been included in the model and/or all theoretically important relevant variables are included.  Failing to include the correct variables in the model will bias the slope coefficients and may increase the likelihood of improperly finding statistical significance.  Including irrelevant variables will make it more difficult to find statistical significance.

 

Correction:  Remove irrelevant variables and, if possible, include missing relevant variables.

 

 

Mean of errors equals zero

 

When the mean error (reflected in residuals) is not equal to zero, the y intercept may be biased.  Violation of this assumption will not affect the slope coefficients. The partial slope coefficients will remain Best Unbiased Linear Estimates (BLUE). 

 

 

Error term is normally distributed

 

The distribution of the error term closely reflects the distribution of the dependent variable.  If the dependent variable is not normally distributed the error term may not be normally distributed.  Violation of this assumption will not bias the partial slope coefficients but may affect significance tests. 

 

Correction:  Always correct other problems first and then re-evaluate the residuals

 

         If the distribution of residuals is skewed to the right (higher values), try using the natural log of the dependent variable

         If the distribution of residuals is skewed to the left (lower values), try squaring the dependent variable.

 

 

Homoskedasticity

 

The variance of the error term is constant for all values of the independent variables.  Heteroskedasticity occurs when the variance of the error term does not have constant variance.  The parameter estimates for partial slopes and the intercept are not biased if this assumption is violated; however, the standard errors are biased and hence significance tests may not be valid.

 

 

 

Diagnosis Of Heteroskedasticity

 

Plot the regression residuals against the values of the independent variable (s). If there appears an even pattern about a horizontal axis, heteroskedasticity is unlikely. 

 

 

For small samples there may be some tapering of each end of the horizontal distribution. 

 

 

If there is a cone or bow tie shaped pattern, heteroskedasticity is suspected.

 

 

Correction:  If an excluded independent variable is suspected, including this variable in the model may correct the problem.  Otherwise, it may be necessary to use generalized least squares (GLS) or weighted least squares (WLS) models to create coefficients that are BLUE. 

 


 

No autocorrelation

 

The error terms are not correlated across observations.  Violation of this assumption is likely to be a problem with time-series data where the value of one observation is not completely independent of another observation.  (Example: A simple two-year time series of the same individuals is likely to find that a person's income in year 2 is correlated with their income in the prior year.)  If there is autocorrelation, the parameter estimates for partial slopes and the intercept are not biased but the standard errors are biased and hence significance tests may not be valid.

 

Diagnosis

 

Suspect autocorrelation with any time series/longitudinal data

Use the Durbin-Watson (d) statistic

 

d=2 > no correlation between error terms

d=0 > perfect positive correlation between error terms

d=4 > perfect negative correlation between error terms

 

Correction:  Use generalized least squares (GLS) or weighted least squares (WLS) models to create coefficients that are BLUE. 

 

 

Multicolinearity

 

The assumption of no multicolinearity is an issue only for multiple regression models.  Multicolinearity occurs when one of the independent variables has a substantial linear relationship with another independent variable in the equation.  It occurs to some extent in any model and is more a matter of degree of colinearity rather than whether it exists or not.  Multicolinearity will result in variability in the partial slope coefficients from one sample to the next or when models are changed slightly.  Standard errors are increased which reduces the likelihood of finding statistical significance.   The result of the above two situations is an unbiased estimator that is very inefficient.

 

Diagnosis

 

         Failing to find any variables statistically significant yet the F-statistic shows the model is significant.

         Dramatic changes in coefficients as independent variables are added or deleted from the model.

         Examine covariation among the independent variables by calculating all possible bivariate combinations of the Pearson correlation coefficient. Generally a high correlation coefficient (say .80 or greater) suggests a problem.  This is imperfect since multicolinearity may not be reflected in a bivariate correlation matrix.

         Regress each independent variable on the other independent variables.   If any of the R2s are near 1.0, there is a high degree of multicolinearity.

 

 

Correction: 

         Increase sample size to lower standard errors. Doesn't always work and is normally not feasible since adding more cases is not a simple exercise in most studies.

         Combine two or more variables that are highly correlated into a single indicator of a concept.

         Delete one of the variables that are highly correlated.  May result in a poorly specified model.

         Leave variables in model and rely on the joint hypothesis F-test to evaluate the significance of the model.  Especially useful if you suspect multicolinearity is causing most if not all of the independent variables to be not significant.

 


Google

 


Copyright 2015, AcaStat Software. All Rights Reserved.