.- help for ^regmsng^ (10May2009 Update of ^regmsng^ from STB-47: sg99) .- Multiple Regression with Missing Observations for Some Variables ---------------------------------------------------------------- ^regmsng^ yvar xvarlist [weight] [^if^ exp]^,^ ^nm^iss^(^#^)^ [^fit^ted ^i^nst^(^varlist^)^ ^st^eps ^ch^ange ^no^hist ^no^test ^noa^ux *] Description ----------- This program's syntax is like that of ^regress^ or ^fit^. The first argument is the dependent variable of a regression and subsequent arguments are independent variables. The principal distinction between ^regmsng^ and these other estimation routines is behavior towards missing values. STATA's standard approach is to omit an observation from the regression if any of the left- or right-hand-side variables contains a missing value for that observation. ^regmsng^ is just as finicky as ^regress^ with respect to the left-hand-side variable, but more tolerant of missing observations on the right-hand-side. Let ^nmiss^ be the maximum number of right-hand-side variables containing a missing value on any observation in a data set. Standard practice in ^regress^, ^fit^ and other members of Stata's estimation family is to drop the entire observation if ^nmiss^ > 0. ^regmsng^ instead allows the user to set ^nmiss^ to an integer greater than zero. For example, if ^nmiss^ is set to equal two, an observation would not be dropped unless three or more right-hand-side variables are missing on that observations. The most obvious approach to including observations that contain missing values of some variables is to replace the missing values of the variables with imputed values. ^regmsng^ takes this approach, if the ^noaux^ option is specified. However, the default approach is to replace the missing values of the variable ^x^ with zeros and define an auxiliary variable, say ^Ax^, which contains zeros where ^x^ is present and the imputed values where ^x^ is missing. This alternative allows the regression to estimate separate regression coefficients for the original variable ^x^ and for the new auxiliary variable ^Ax^. If the estimated coefficients of ^x^ and ^Ax^ differ, constraining them to be equal with the ^noaux^ option will damage the efficiency of the estimates of the other coefficients. After the estimation is performed, the program tests the hypothesis that the coefficient of each zero-filled r.h.s. variable, ^x^, is equal to the coefficient of its companion constructed variable, ^Ax^. Failure to reject indicates that estimation with the ^noaux^ option would not damage the fit of the regression. This in turn would suggest that the imputation of the missing variables using the current value of ^nmiss^ does not greatly alter the regression results, at least with respect to this variable. Conversely, rejection of the hypothesis that the coefficient of ^x^ equals the coefficient of ^Ax^ suggests caution in using the results for this value of ^nmiss^ and this data set. Either the imputation is bad for this variable (i.e. it is not producing accurate forecasts of the missing values), or the coefficient of ^x^ differs between the subset of observations for which ^x^ is observed and the subset for which it is not observed. On completion, ^regmsng^ restores the data to its original condition unless the user specifies the ^change^ option. Specifying the ^change^ option leaves behind only the observations with no more than ^nmiss^ missing values of any of the specified RHS variables. Those missing values will either be zero-filled and the new "A-variables" will contain the imputed values for those variables or, if ^noaux^ is specified, will contain the imputed values. Issuing the ^regmsng^ command without a variable list or options causes STATA to display the results of the last execution of ^regmsng^. Restrictions: 1. This command is intended to apply only to multiple regression. Therefore, there must be at least two right-hand side variables. 2. The names of the right-hand-side variables must have no more than 30 characters (to allow creation of a new variable called Ax for any RHS variable x with missing values). [In the original version of this program, the limit was a maximum of 7 characters.] 3. The ^in^ option is not allowed. 4. All four types of weights are accepted and applied in the intermediate calculations producing the means or the fitted values as well as in the final regression. The variable or variables that define the weight are assumed not to have any missing values. Use weights with caution. 5. The command makes no attempt to correct the estimated standard errors of the coefficients for the fact that some of the data has been imputed. Such a correction would typically increase the standard errors. Options ------- ^nmiss^(^#^) Observations for which more than this number of right-hand-side variables have missing data will be dropped from the regression. ^fitted^ Replace missing values with the fitted values from regressions, a technique sometimes referred to as "first-order mean replacement." Candidates for regressors in an auxiliary regression to predict missing values of a r.h.s. variable ^x^ include the other r.h.s. variables in the original regression plus the intrumental variables specified in the ^inst^ option, if any. Candidate variables are rejected, however, if they are missing on any of the observations on which ^x^ is missing. If ^fitted^ is not specified, the default is to replace the missing values by the means of those values (i.e. "zero-order mean replacement"). ^inst^(^varlist^) This option can only be specified with ^fitted^. It specifies additional instrumental variables to use in predicting the missing values of the r.h.s. variables, i.e. variables not included among the list of r.h.s. for the regression. For predicting any given r.h.s. variable, the only variables used are those other r.h.s. variables or instruments which are NOT missing on ALL the observations where the given variable is missing. ^steps^ Causes the program to document the steps involved in imputing the missing values. If ^fitted^ was chosen, the first-stage regressions are presented. Otherwise the variable means are presented. ^change^ Allows the program to leave behind changed data and the new variables. ^WARNING!^ This option destroys the data in memory before the command was issued. Use this option carefully. ^nohist^ Suppresses the output of the tabulation of the number of observations by the number of missing values. ^notest^ Suppresses the output of the tests that otherwise follow the estimation. ^noaux^ Substitutes imputed values for missing values on the RHS. The default (if this option is not specified), is to define an auxiliary variable to contain the imputed values and to replace the missing values of the RHS variable with zeros. ^*^ Other options are passed through to the ^regress^ command Examples -------- . ^regmsng y x1 x2 x3, nmiss(1)^ Estimate a regression of ^y^ on ^x1^, ^x2^ and ^x3^ including not only the observations with no missing values, but also those that are complete except for one of the three independent variables. In other words, drop observations that are missing more than one of the three r.h.s. variables. Missing values are replaced by means, which are used to construct auxiliary variables ^Ax1^, ^Ax2^ and ^Ax3^. . ^regmsng y x1 x2 x3, nmiss(2) fitted inst(x4 t)^ Loosen the restriction to include observations that are missing either one or two of the three r.h.s. variables. Estimate missing values by the fitted values of regressions on the other r.h.s. variables plus two other exogenous variables, ^x4^ and a date or time trend variable called ^t^. . ^regmsng y x1 x2 x3, nmiss(2) fitted inst(x4 t) nohist notest^ Same as above, but suppressing some of the output. . ^regmsng y x1 x2 x3, nmiss(3) fitted inst(x4 t) steps^ Same as above, but giving complete output. Author ------ Mead Over World Bank meadover@@CGDev.org Also see -------- STB: STB-47 sg99 Manual: ^[R] impute^ On-line: ^help^ for @impute@; ^lookup^ for ^pattern^