.-
help for ^regmsng^ (10May2009 Update of ^regmsng^ from STB-47: sg99)
.-
Multiple Regression with Missing Observations for Some Variables
----------------------------------------------------------------
^regmsng^ yvar xvarlist [weight] [^if^ exp]^,^ ^nm^iss^(^#^)^
[^fit^ted ^i^nst^(^varlist^)^ ^st^eps ^ch^ange ^no^hist ^no^test ^noa^ux *]
Description
-----------
This program's syntax is like that of ^regress^ or ^fit^. The first
argument is the dependent variable of a regression and subsequent
arguments are independent variables. The principal distinction between
^regmsng^ and these other estimation routines is behavior towards
missing values. STATA's standard approach is to omit an observation from
the regression if any of the left- or right-hand-side variables contains a
missing value for that observation. ^regmsng^ is just as finicky as
^regress^ with respect to the left-hand-side variable, but more tolerant
of missing observations on the right-hand-side.
Let ^nmiss^ be the maximum number of right-hand-side variables containing
a missing value on any observation in a data set. Standard practice in
^regress^, ^fit^ and other members of Stata's estimation family is to drop
the entire observation if ^nmiss^ > 0. ^regmsng^ instead allows the user to
set ^nmiss^ to an integer greater than zero. For example, if ^nmiss^ is
set to equal two, an observation would not be dropped unless three or more
right-hand-side variables are missing on that observations.
The most obvious approach to including observations that contain missing
values of some variables is to replace the missing values of the variables
with imputed values. ^regmsng^ takes this approach, if the ^noaux^ option
is specified. However, the default approach is to replace the missing values
of the variable ^x^ with zeros and define an auxiliary variable, say ^Ax^,
which contains zeros where ^x^ is present and the imputed values where
^x^ is missing. This alternative allows the regression to estimate
separate regression coefficients for the original variable ^x^ and for the
new auxiliary variable ^Ax^. If the estimated coefficients of ^x^ and ^Ax^
differ, constraining them to be equal with the ^noaux^ option will damage
the efficiency of the estimates of the other coefficients.
After the estimation is performed, the program tests the hypothesis that
the coefficient of each zero-filled r.h.s. variable, ^x^, is equal to the
coefficient of its companion constructed variable, ^Ax^. Failure to reject
indicates that estimation with the ^noaux^ option would not damage the fit of
the regression. This in turn would suggest that the imputation of the
missing variables using the current value of ^nmiss^ does not greatly
alter the regression results, at least with respect to this variable.
Conversely, rejection of the hypothesis that the coefficient of ^x^ equals
the coefficient of ^Ax^ suggests caution in using the results for this value
of ^nmiss^ and this data set. Either the imputation is bad for this
variable (i.e. it is not producing accurate forecasts of the missing
values), or the coefficient of ^x^ differs between the subset of
observations for which ^x^ is observed and the subset for which it is not
observed.
On completion, ^regmsng^ restores the data to its original condition unless
the user specifies the ^change^ option. Specifying the ^change^ option
leaves behind only the observations with no more than ^nmiss^ missing
values of any of the specified RHS variables. Those missing values will
either be zero-filled and the new "A-variables" will contain the imputed
values for those variables or, if ^noaux^ is specified, will contain the
imputed values.
Issuing the ^regmsng^ command without a variable list or options causes
STATA to display the results of the last execution of ^regmsng^.
Restrictions:
1. This command is intended to apply only to multiple regression. Therefore,
there must be at least two right-hand side variables.
2. The names of the right-hand-side variables must have no more than 30
characters (to allow creation of a new variable called Ax for any RHS
variable x with missing values). [In the original version of this program,
the limit was a maximum of 7 characters.]
3. The ^in^ option is not allowed.
4. All four types of weights are accepted and applied in the intermediate
calculations producing the means or the fitted values as well as in the final
regression. The variable or variables that define the weight are assumed
not to have any missing values. Use weights with caution.
5. The command makes no attempt to correct the estimated standard errors
of the coefficients for the fact that some of the data has been imputed.
Such a correction would typically increase the standard errors.
Options
-------
^nmiss^(^#^) Observations for which more than this number of right-hand-side
variables have missing data will be dropped from the regression.
^fitted^ Replace missing values with the fitted values from regressions,
a technique sometimes referred to as "first-order mean
replacement." Candidates for regressors in an auxiliary
regression to predict missing values of a r.h.s. variable
^x^ include the other r.h.s. variables in the original
regression plus the intrumental variables specified in the ^inst^
option, if any. Candidate variables are rejected, however,
if they are missing on any of the observations on which ^x^
is missing. If ^fitted^ is not specified, the default is to
replace the missing values by the means of those values (i.e.
"zero-order mean replacement").
^inst^(^varlist^) This option can only be specified with ^fitted^. It specifies
additional instrumental variables to use in predicting the
missing values of the r.h.s. variables, i.e. variables not
included among the list of r.h.s. for the regression. For
predicting any given r.h.s. variable, the only variables
used are those other r.h.s. variables or instruments which
are NOT missing on ALL the observations where the given
variable is missing.
^steps^ Causes the program to document the steps involved in imputing
the missing values. If ^fitted^ was chosen, the first-stage
regressions are presented. Otherwise the variable means are
presented.
^change^ Allows the program to leave behind changed data and the new
variables. ^WARNING!^ This option destroys the data in memory
before the command was issued. Use this option carefully.
^nohist^ Suppresses the output of the tabulation of the number of
observations by the number of missing values.
^notest^ Suppresses the output of the tests that otherwise follow the
estimation.
^noaux^ Substitutes imputed values for missing values on the RHS. The
default (if this option is not specified), is to define an
auxiliary variable to contain the imputed values and to replace
the missing values of the RHS variable with zeros.
^*^ Other options are passed through to the ^regress^ command
Examples
--------
. ^regmsng y x1 x2 x3, nmiss(1)^
Estimate a regression of ^y^ on ^x1^, ^x2^ and ^x3^ including not only the
observations with no missing values, but also those that are complete except
for one of the three independent variables. In other words, drop observations
that are missing more than one of the three r.h.s. variables. Missing values
are replaced by means, which are used to construct auxiliary variables ^Ax1^,
^Ax2^ and ^Ax3^.
. ^regmsng y x1 x2 x3, nmiss(2) fitted inst(x4 t)^
Loosen the restriction to include observations that are missing either one
or two of the three r.h.s. variables. Estimate missing values by the
fitted values of regressions on the other r.h.s. variables plus two other
exogenous variables, ^x4^ and a date or time trend variable called ^t^.
. ^regmsng y x1 x2 x3, nmiss(2) fitted inst(x4 t) nohist notest^
Same as above, but suppressing some of the output.
. ^regmsng y x1 x2 x3, nmiss(3) fitted inst(x4 t) steps^
Same as above, but giving complete output.
Author
------
Mead Over
World Bank
meadover@@CGDev.org
Also see
--------
STB: STB-47 sg99
Manual: ^[R] impute^
On-line: ^help^ for @impute@; ^lookup^ for ^pattern^