PROBLEM 17: DUMMY VARIABLE PITFALLS

OBJECT OF THE PROBLEM

To familiarize you with some of the difficulties in using dummy

variables including perfect multicollinearity and the so-called

"dummy variable trap".

INTRODUCTION

When using dummy variables one must be certain to apply them

correctly. In particular, one must be careful not to construct a

linear combination of variables that add up to one while the

intercept term (a column of ones) is used in the model. This

leads to perfect multicollinearity and prevents the matrix from

being inverted. This often occurs when there are more then two

categories for the dummy variable. A way to avoid it is to run

the model without the intercept term. Alternatively, if one has

several dummy variables that are causing the problem one of them

could be left out. The value of the intercept would then be the

coefficient for that omitted dummy variable.

Besides just the simple dummy variable which implies a

simple shift in the intercept term, one can also make the

adjustments to slope by multiplying the dummy by the continuous

independent variables in the model. This allows for both slope

and intercept changes. Alternatively, we could use just slope

changes by omitting the 0-1 coded variables.

In this exercise you use several different dummy variable

techniques. They include the slope adjustment model, intercept

adjustment model, and slope-intercept adjustment model.

The exercise uses the job-worker characteristics data which

is stored on disk space. The groups PROF, SKIL, UNSKL and UNEM

are defined to be mutually exclusive and exhaustive. In other

words, PROF + SKIL + UNSKL + UNEM = 1. The data should be

familiar to you. The models run try to explain wage based on the

continuous variable of education and experience, and the dummy

variables of sex and profession.

PROBLEM PROGRAM

The following program can be used in the problem:

//PROB17 JOB (AF,A592),yourname,REGION=4096K

/*OPENBIN

//S1 EXEC SAS

//CRAYPO DD DSN=AU5920.ECON592.DATA(WORKER),DISP=SHR

//SYSIN DD *

DATA WOLFSON; INFILE CRAYPO;

INPUT WORK $ 1 EDUC 2-3 EXPER 4-5 AGE 6-7 SEX $ 8 WAGE 14-17; *

consider the following four groups to be mutually exclusive and

exhaustive; PROF=0; IF WORK='P' THEN PROF=1;

UNSKL=0; IF WORK='N' THEN UNSKL=1;

UNEM=0; IF WORK='U' THEN UNEM=1;

SKIL=0; IF WORK='S' THEN SKIL=1;

MALE=0; IF SEX='M' THEN MALE=1; FEMALE=0;IF SEX='F' THEN

FEMALE=1; PROC GLM;

MODEL WAGE=PROF UNEM UNSKL SKIL EDUC EXPER AGE;

PROC GLM;

MODEL WAGE=PROF UNEM UNSKL SKIL EDUC EXPER AGE/NOINT;

PROC GLM;

MODEL WAGE=PROF UNEM UNSKL EDUC EXPER AGE;

PROC GLM;

MODEL WAGE=PROF UNEM UNSKL MALE EXPER EDUC AGE;

PROC GLM;

MODEL WAGE=PROF*MALE UNEM*MALE UNSKL*MALE SKIL*MALE ROF

UNEM UNSKL SKIL EXPER EDUC AGE/NOINT;

PROC GLM;

MODEL WAGE=EXPER EDUC MALE MALE*EXPER MALE*EDUC ALE*AGE

AGE;

//

DISCUSSION OF THE PROGRAM

The first few models are examples of various intercept adjustment

models. The fourth model is one with dummy variables for two

different effects-sex and profession as opposed to just

profession in the first few examples. Note that the intercept is

interpreted as being the value for the omitted dummy variable.

This model shows that as we increase the number of variables we

begin to have a problem in loosing degrees of freedom. The next

model is a fully adjusted model for the two dummies. The final

model is a slope-intercept adjustment model for just one dummy

variable.

Note that some of the models cannot be run by SAS. This is

due to perfect multicollinearity among linear combinations of the

independent variables. You should be able to determine why this

is happening.

INSTRUCTIONS FOR WRITE-UP

Give an intuitive and econometrics notation explanation of each

step of the program, and an intuitive explanation of what the

problem is trying to accomplish. Explain how using NOINT allows

you to avoid having to select a default group. State what the

default group is for each model that does not use the NOINT

option. Write down each of the estimated models and analyze them.

Give an intuitive explanation of each model. Analyze them in

terms of t and F-statistics, R-square, etc. Label all output.

Show why SAS couldn't run some of the models and had to kick one

or more of the dummy variables out of the model and thus making

those dummies the default group. Did these models fall into the

infamous "Dummy Variable Trap"? Does a person's sex and/or

profession affect his or her wage? In general, what are the

properties of the GLM estimators?