Multiple
Regression Assumptions
No measurement error
Independent
(X) and dependent (Y) variables are accurately measured: IV any measurement error will bias the estimates.
DVmay be unbiased if the error is random.
The consequences of random error in
an Independent Variable may be lower R^{2}, partial slope coefficients can
vary dramatically depending on the amount of random error in the independent variables and
the partial slope coefficients of independent variables that do not have random
measurement error will be biased if they are correlated with another independent variable
with measurement error.
No specification error
The
theoretical model is linear, additive, and includes the correct variables. Linear implies that the average change in the
dependent variable associated with a one unit change in an independent variable is
constant regardless of the level of the independent variable. If the partial slope for X
is not constant for differing values of X, X has a nonlinear relationship with Y and
results in biased partial slopes.
NonLinearity
Not Linear
but slope does not change direction.
Correction:
A loglog model takes a nonlinear specification where the slope changes as the value of X
increases and makes it linear in terms of interpreting the parameter estimates. It
accomplishes this by transforming the dependent and all independent variables by taking
their log and replacing the original variables with the logged variables. The result is coefficients that are interpreted as
a % change in Y given a 1% change in X.
Example: log Y= a + b1logX1
+ b2logX2 + b3logX3
+ log e
Note: the data must have positive values. A log of a negative value will equal 0 and as a
result biases the model. Use the anti log of
coefficients to estimate y.
Not linear
and the slope changes direction (pos or neg)
Correction:
A polynomial model may be used
to correct for changes in the slope coefficient sign (positive or negative): This is accomplished by adding additional variables
that are incremental powers of the independent variable to model bends in the slope.
Example: Y = a + b1X1
+ b2X2^{2}
+ b3X3^{3}
+ e _{}
NonAdditivity
Additive
implies that the average change in the dependent variable associated with a one unit
change in an independent variable (X_{1}) is constant regardless of the value of
another independent variable (X_{2}) in the model.
If this assumption is violated, we can no longer interpret the slope by
saying "holding other variables constant" since the values of the other
variables may possibly change the slope coefficient and therefore its interpretation.
The figure
above displays a nonadditive relationship when (X_{1}) is interval/ratio and (X_{2})
is a dummy variable. If the partial slope for
(X_{1}) is not constant for differing values of (X_{2}), (X_{1})
and (X_{2}) do not have an additive relationship with Y.
Correction: An
interaction term may be added using a dummy variable
where the slope of X_{1} is thought to depend on the value of a dummy variable
X_{2}. The final model will look like the following:
Model: Y
= a + b1X1 + b2X1X2
+ e _{}
where
X_{1}X_{2}=the interaction between X_{1} and X_{2}
or X_{1}*X_{2}
Interpretation:
b_{1}
is interpreted as the slope for X_{1} when the dummy variable (X_{2}) is 0
b_{1}+
b_{2} is interpreted as the slope for X_{1} when the dummy variable (X_{2})
is 1
Correction:
Using a
multiplicative model for two interval independent variables
Used with
two interval level independent variables that are thought to interact in how they
influence Y.
Model
without interaction term
Y = a + b1X1
+b2X2 + e
_{}
Model with
interaction term
Y = a
+ b1X1 +b2X2
++ b3X1X2
+ e
_{}
where
b_{3}
= X_{1}*X_{2
}is_{ }the interactive term
Incorrect
Independent Variables
Including
the correct independent variables implies that an irrelevant variable has not been
included in the model and/or all theoretically important relevant variables are included. Failing to include the correct variables in the
model will bias the slope coefficients and may increase the likelihood of improperly
finding statistical significance. Including
irrelevant variables will make it more difficult to find statistical significance.
Correction: Remove
irrelevant variables and, if possible, include missing relevant variables.
Mean of errors equals zero
When the
mean error (reflected in residuals) is not equal to zero, the y intercept may be biased. Violation of this assumption will not affect the
slope coefficients. The partial slope coefficients will remain Best Unbiased Linear
Estimates (BLUE).
Error term is normally distributed
The
distribution of the error term closely reflects the distribution of the dependent
variable. If the dependent variable is not
normally distributed the error term may not be normally distributed. Violation of this assumption will not bias the
partial slope coefficients but may affect significance tests.
Correction: Always
correct other problems first and then reevaluate the residuals
·
If the
distribution of residuals is skewed to the right (higher values), try using the natural
log of the dependent variable
·
If the
distribution of residuals is skewed to the left (lower values), try squaring the dependent
variable.
Homoskedasticity
The variance
of the error term is constant for all values of the independent variables. Heteroskedasticity occurs when the variance of the
error term does not have constant variance. The
parameter estimates for partial slopes and the intercept are not biased if this assumption
is violated; however, the standard errors are biased and hence significance tests may not
be valid.
Diagnosis
Of Heteroskedasticity
Plot the
regression residuals against the values of the independent variable (s). If there appears
an even pattern about a horizontal axis, heteroskedasticity is unlikely.
For small
samples there may be some tapering of each end of the horizontal distribution.
If there is
a cone or bow tie shaped pattern, heteroskedasticity is suspected.
Correction: If an
excluded independent variable is suspected, including this variable in the model may
correct the problem. Otherwise, it may be
necessary to use generalized least squares (GLS) or weighted least squares (WLS) models to
create coefficients that are BLUE.
No autocorrelation
The error
terms are not correlated across observations. Violation
of this assumption is likely to be a problem with timeseries data where the value of one
observation is not completely independent of another observation. (Example: A simple twoyear time series of the same
individuals is likely to find that a person's income in year 2 is correlated with their
income in the prior year.) If there is
autocorrelation, the parameter estimates for partial slopes and the intercept are not
biased but the standard errors are biased and hence significance tests may not be valid.
Diagnosis
Suspect
autocorrelation with any time series/longitudinal data
Use the
DurbinWatson (d) statistic
d=2 >
no correlation between error terms
d=0 >
perfect positive correlation between error terms
d=4 >
perfect negative correlation between error terms
Correction: Use
generalized least squares (GLS) or weighted least squares (WLS) models to create
coefficients that are BLUE.
Multicolinearity
The
assumption of no multicolinearity is an issue only for multiple regression models. Multicolinearity occurs when one of the independent
variables has a substantial linear relationship with another independent variable in the
equation. It occurs to some extent in any
model and is more a matter of degree of colinearity rather than whether it exists or not. Multicolinearity will result in variability in the
partial slope coefficients from one sample to the next or when models are changed
slightly. Standard errors are increased which
reduces the likelihood of finding statistical significance.
The result of the above two situations is an unbiased estimator that is very
inefficient.
Diagnosis
·
Failing to
find any variables statistically significant yet the Fstatistic shows the model is
significant.
·
Dramatic
changes in coefficients as independent variables are added or deleted from the model.
·
Examine
covariation among the independent variables by calculating all possible bivariate
combinations of the Pearson correlation coefficient. Generally a high correlation
coefficient (say .80 or greater) suggests a problem. This
is imperfect since multicolinearity may not be reflected in a bivariate correlation
matrix.
·
Regress each
independent variable on the other independent variables.
If any of the R^{2}s are near 1.0, there is a high degree of
multicolinearity.
Correction:
·
Increase
sample size to lower standard errors. Doesn't always work and is normally not feasible
since adding more cases is not a simple exercise in most studies.
·
Combine two
or more variables that are highly correlated into a single indicator of a concept.
·
Delete one
of the variables that are highly correlated. May
result in a poorly specified model.
·
Leave
variables in model and rely on the joint hypothesis Ftest to evaluate the significance of
the model. Especially useful if you suspect
multicolinearity is causing most if not all of the independent variables to be not
significant.
