PROBLEM 17: DUMMY VARIABLE PITFALLS
OBJECT OF THE PROBLEM
To familiarize you with some of the difficulties in using dummy
variables including perfect multicollinearity and the so-called
"dummy variable trap".
INTRODUCTION
When using dummy variables one must be certain to apply them
correctly. In particular, one must be careful not to construct a
linear combination of variables that add up to one while the
intercept term (a column of ones) is used in the model. This
leads to perfect multicollinearity and prevents the matrix from
being inverted. This often occurs when there are more then two
categories for the dummy variable. A way to avoid it is to run
the model without the intercept term. Alternatively, if one has
several dummy variables that are causing the problem one of them
could be left out. The value of the intercept would then be the
coefficient for that omitted dummy variable.
Besides just the simple dummy variable which implies a
simple shift in the intercept term, one can also make the
adjustments to slope by multiplying the dummy by the continuous
independent variables in the model. This allows for both slope
and intercept changes. Alternatively, we could use just slope
changes by omitting the 0-1 coded variables.
In this exercise you use several different dummy variable
techniques. They include the slope adjustment model, intercept
adjustment model, and slope-intercept adjustment model.
The exercise uses the job-worker characteristics data which
is stored on disk space. The groups PROF, SKIL, UNSKL and UNEM
are defined to be mutually exclusive and exhaustive. In other
words, PROF + SKIL + UNSKL + UNEM = 1. The data should be
familiar to you. The models run try to explain wage based on the
continuous variable of education and experience, and the dummy
variables of sex and profession.
PROBLEM PROGRAM
The following program can be used in the problem:
//PROB17 JOB (AF,A592),yourname,REGION=4096K
/*OPENBIN
//S1 EXEC SAS
//CRAYPO DD DSN=AU5920.ECON592.DATA(WORKER),DISP=SHR
//SYSIN DD *
DATA WOLFSON; INFILE CRAYPO;
INPUT WORK $ 1 EDUC 2-3 EXPER 4-5 AGE 6-7 SEX $ 8 WAGE 14-17; *
consider the following four groups to be mutually exclusive and
exhaustive; PROF=0; IF WORK='P' THEN PROF=1;
UNSKL=0; IF WORK='N' THEN UNSKL=1;
UNEM=0; IF WORK='U' THEN UNEM=1;
SKIL=0; IF WORK='S' THEN SKIL=1;
MALE=0; IF SEX='M' THEN MALE=1; FEMALE=0;IF SEX='F' THEN
FEMALE=1; PROC GLM;
MODEL WAGE=PROF UNEM UNSKL SKIL EDUC EXPER AGE;
PROC GLM;
MODEL WAGE=PROF UNEM UNSKL SKIL EDUC EXPER AGE/NOINT;
PROC GLM;
MODEL WAGE=PROF UNEM UNSKL EDUC EXPER AGE;
PROC GLM;
MODEL WAGE=PROF UNEM UNSKL MALE EXPER EDUC AGE;
PROC GLM;
MODEL WAGE=PROF*MALE UNEM*MALE UNSKL*MALE SKIL*MALE ROF
UNEM UNSKL SKIL EXPER EDUC AGE/NOINT;
PROC GLM;
MODEL WAGE=EXPER EDUC MALE MALE*EXPER MALE*EDUC ALE*AGE
AGE;
//
DISCUSSION OF THE PROGRAM
The first few models are examples of various intercept adjustment
models. The fourth model is one with dummy variables for two
different effects-sex and profession as opposed to just
profession in the first few examples. Note that the intercept is
interpreted as being the value for the omitted dummy variable.
This model shows that as we increase the number of variables we
begin to have a problem in loosing degrees of freedom. The next
model is a fully adjusted model for the two dummies. The final
model is a slope-intercept adjustment model for just one dummy
variable.
Note that some of the models cannot be run by SAS. This is
due to perfect multicollinearity among linear combinations of the
independent variables. You should be able to determine why this
is happening.
INSTRUCTIONS FOR WRITE-UP
Give an intuitive and econometrics notation explanation of each
step of the program, and an intuitive explanation of what the
problem is trying to accomplish. Explain how using NOINT allows
you to avoid having to select a default group. State what the
default group is for each model that does not use the NOINT
option. Write down each of the estimated models and analyze them.
Give an intuitive explanation of each model. Analyze them in
terms of t and F-statistics, R-square, etc. Label all output.
Show why SAS couldn't run some of the models and had to kick one
or more of the dummy variables out of the model and thus making
those dummies the default group. Did these models fall into the
infamous "Dummy Variable Trap"? Does a person's sex and/or
profession affect his or her wage? In general, what are the
properties of the GLM estimators?