Richard Williams, Notre Dame Sociology

Sociology 73994

Categorical Data Analysis

Richard Williams, Instructor

Spring 2018

NOTE:  These are the Spring 2018 course notes for my categorical data graduate statistics course.  These pages will be updated whenever I complete another session of the course, and possibly sooner.  Notes from the current semester when the course is being taught can be found here

Stata Highlights page includes links to Stata and statistical handouts from my other courses that may interest readers.

Stata is in the labs.  You can also order your own personal copy of Stata through the GradPlan package.  Stata 15 is now out and I encourage you to get it. Stata 14, 13 or 12 is ok for most purposes but versions of Stata older than that lack some features we will definitely want to use.


NOTE: The following special types of files are used on this web page. Some materials are available only to users.

PDF  Pdf files. Require Adobe Acrobat.  Get Acrobat Reader

  Stata files.

Useful sites for learning about Stata and SPSS

Rich Williams' Stata Highlights Page

UCLA's Statistical Computing Resources 
RW Suggestions for Using Stata at Notre Dame 

UCLA's Stata Resources

RW's Suggested downloads

UCLA - How does Stata compare with SAS and SPSS?
Resources for learning Stata Wisconsin SSCC Articles on Statistical Computing
The Stata User Support Page Ben Jann's estout/esttab support page (esttab & estout are great for formatting output from Stata)

Overview.  This course discusses methods and models for the analysis of categorical dependent variables and their applications in social science research. Researchers are often interested in the determinants of categorical outcomes. For example, such outcomes might be binary (lives/dies), ordinal (very likely/ somewhat likely/ not likely, nominal (taking the bus, car, or train to work) or count (the number of times something has happened, such as the number of articles written). When dependent variables are categorical rather than continuous, conventional OLS regression techniques are not appropriate. This course therefore discusses the wide array of methods that are available for examining categorical outcomes.


Regression Models for Categorical Dependent Variables Using Stata (required text; as far as I know it is cheapest to order directly from the Stata Bookstore)

Fixed Effects Regression Models, by Paul Allison. (required text; order it from Amazon or elsewhere)

Stata Files Stata data files used in the course will usually be stored in this folder.

Recommended Reading (ND.Edu Netid is required for access)

Cloud Storage. I strongly urge you to use something like Dropbox, Google Drive, Microsoft Onedrive, Box, or a similar service to back up your most critical files. With so many free and inexpensive services now available, there is really no excuse for losing more than a day's work, if that. I am very fond of Dropbox but just about anything will do. For more options see Notre Dame Collaboration and File Storage.

Brief Review of Models for Continuous Outcomes (Review this on your own and come to me with questions as needed)

PDF  Review of Multiple Regression

reg01.dta - Data file used in the Stata Regression handout

PDF  Using Stata for OLS Regression  (If you are interested, click here for a similar handout using SPSS)

I. Foundations of categorical data analysis.

This section will go over the basics of logistic regression. It will also go over techniques for making results more interpretable; analyzing data sets with complex sampling schemes; and (possibly) techniques for handling missing data. I call these topics "foundations" because once you understand them it is very easy to extend them to other CDA methods.

Overview of Generalized Linear Models, Maximum Likelihood Estimation

    Introduction to Generalized Linear Models

   - Stata program for GLM handout

    Maximum Likelihood Estimation (May or may not cover in class, but be sure to read it)

   - Stata program for MLE handout

    Assignment 1: Preliminary Data Analysis & Setup. Due January 25, 2018.

Models for Binomial Outcomes: Basics of Logistic Regression

Logistic Regression I: Problems with the Linear Probability Model (LPM) - Stata file(s) used in the logistic regression 1 handout

Logistic Regression II: The Logistic Regression Model (LRM) - Stata file(s) used in the logistic regression 2 handout

Logistic Regression III: Hypothesis Testing, Comparisons with OLS - Stata file(s) used in the logistic regression 3 handout

Measures of fit - Pseudo R^2, BIC, AIC (Read on your own; I may just discuss a few highlights)

   - Stata program for measures of fit

Using Stata for Logistic Regression (be sure to read this on your own, as it covers important details we may not go over in class) - Stata file(s) used in the using stata for logistic regression handout

 logist.dta - Stata data file used in the Logistic Regression handouts

(Optional) Paul Allison further elaborates on the merits of the logit model. However, Paul von Hippel maintains the LPM often isn't so bad. Allison responds backMarc Bellamare, Christopher Zorn, Dave Giles, and Marc Bellamare (again) further debate the issue. My own feeling is that the best of both worlds is to use logit and then couple it with marginal effects.

    Assignment 2: Basics of Logistic Regression. Due February 1, 2018.

   - If you are curious, this shows how the fake data were created

   - do file used for the results presented in HW # 2

            cruz.dta - data file used in HW # 2

Interpreting results: Adjusted Predictions and Marginal effects

The results from binomial and ordinal models can often be difficult to interpret. All too often, researchers discuss the sign and statistical significance of results but say little about their substantive significance. I will expect every student paper to use the methods described in this section and/or one of the advanced methods we discuss later in the course. I will also expect students to be familiar with some of Long & Freese's spost13 commands, such as mtable and mchange. I will probably go over the first handout fairly carefully and then skim the rest, since they are mostly examples you can go over on your own.

    Using Stata's Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects (click here for the Powerpoint version).

   - Stata program for margins #1 handout

Also - the Stata Journal article I wrote on this is available for free. For an application of the margins command, see my 2013 article with Lutz Bornmann entitled How to calculate the practical significance of citation impact differences? Also available here. The do file and data file required to replicate the paper's analysis are here.

   Marginal Effects for Continuous Variables. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.

   - Stata program for margins #2 handout

   Understanding & Interpreting the Effects of Continuous Variables: The MCP Command. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.

   - Stata program for margins #3 handout

   Using the spost13 commands for adjusted predictions and marginal effects with binary dependent variables. This is pretty easy so I may just have you read this on your own.

   - Stata program for margins #4 handout

   Adjusted predictions and marginal effects for multiple outcome commands and models. Commands like ologit, oprobit, mlogit, oglm, and gologit2 estimate models where there are multiple outcomes, e.g. the dependent variables is ordinal and has 5 categories. The same margins and spost13 commands can generally be used with each. This handout explains how.

   - Stata program for margins #5 handout

    Assignment 3: Interpreting results: Adjusted Predictions and Marginal effects Due February 8, 2018.

   - do file used for the results presented in HW # 3

            cruz.dta - data file used in HW # 3

Categorical Data Analysis with Complex Survey Designs

By default, most statistical techniques assume that data were collected via simple random sampling. This is often not true for large national data sets. Fortunately, Stata makes it easy to analyze such data, but there are some important differences in how you go about testing hypotheses and assessing model fit.

  Analyzing Complex Survey Data: Some key issues to be aware of. I've consolidated what had been a few separate handouts. Key issues for both OLS regression and categorical data analysis are discussed.

   - Stata program for svy analysis handout

        Also worth seeing:    UCLA's (see lower third of page) and StataCorp's FAQS on Survey Data Analysis

Missing data

I am mostly covering this here because it is an important topic and there wasn't enough time to cover it in the new Stats I! But, several of the methods do involve the use of categorical data analysis, so it isn't totally out of place. This will be review for those of you who have had me before.

Missing Data Part 1: Overview, Traditional Methods. Read on your own. It mostly explains why most traditional methods for handling missing data (other than listwise deletion) are seriously flawed. I won't talk about this in class., md.dta - Stata files used in the Missing Data Part 1 handout & in the homework

Missing Data Part 2: Multiple Imputation & Maximum Likelihood - Stata file for the MD Part 2 handout

Also worth seeing: The Wisconsin Social Science Computing Cooperative has some great pages on MI. Also possibly helpful are the Statacorp FAQs on MI and this page from UCLA.

    Assignment 4: Complex survey Designs; Multiple Imputation Due February 20, 2018 (note the change of date)

   - do file used for the results presented in HW # 4

II. Intermediate CDA Methods

Here we will talk about other commonly used CDA methods, including ordinal regression, models for multinomial outcomes, and models for count outcomes.

Models for Ordinal Outcomes I: The ordered logit and interval regression models

    Ordinal Logit Models: Basic & Intermediate Topics

   - Stata program for ologit overview

(Optional) It doesn't have much to do with statistics, but here is the true story of the engineer who tried to save Challenger.

Interval Regression

   - Stata program for interval regression

Models for Multinomial Outcomes

When categories are unordered, Multinomial Logistic regression is one often-used strategy. We will discuss several ways to aid in the interpretation and testing of these models.

    Multinomial Logit - Overview

   - Stata program for mlogit, including adjusted predictions & marginal effects

    Other Post-Estimation Commands for mlogit

   - Stata program for other mlogit post-estimation commands

    Assignment 5: Ordinal and Multinomial Models Due March 1, 2018.

   - do file used in HW # 5

Models for Count Outcomes

Variables that count the # of times something happens are common in the Social Sciences. For example, Long examined the # of publications by scientists. Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. In this section we will examine some of the many models that deal explicitly with count outcomes.

    Count Models

   - Stata program for count models

Assignment 6: Paper Proposal - Due March 8, 2018. Send copies to both the instructor and the TA.

Assignment 7: Count Models - Due March 22, 2018

   - do file used in HW # 7

III. Advanced Topics (Subject to Change or Re-Ordering)

Models for Binary Outcomes II: Intermediate Logistic Regression

    The Latent Variable Model for Binary Regression

   - Stata program for Latent variable handout

    Standardized Coefficients in Logistic Regression

   - Stata program for standardized coefficients

    Alternatives to logistic regression

Models for Binary Outcomes III: Comparing logit and probit coefficients across nested models

    Prelude to Comparing Logit & Probit Coefficients Between Nested Models

   - Stata program for Prelude to comparing coefficients across nested models

    Comparing Logit & Probit Coefficients Between Models  (click here for Powerpoint version)

    Handout for Comparing Logit & Probit Coefficients Between Models 

    Comparing Logit & Probit Coefficients Between Nested Models (Extended Version). OPTIONAL.  This is actually an older version of the handout but it includes several additional points that might be helpful.

Assignment 8: Intermediate issues in logistic regression analysis Due April 5, 2018

   - do file used in HW # 7


Panel Data/ Multilevel Models

Sometimes the same individuals (or nations, or companies) are measured at multiple points in time. The statistical technique used needs to reflect the fact that the different measurements are not independent of each other. This is a big topic and goes well beyond Categorical Data Analysis, but a few basic commands, e.g. xtlogit, will be discussed.

  Panel Data 0: Brief Overview of Linear Models. Linear Models. Read this on your own. It primarily focuses on techniques for panel data with continuous variables. Some key points are repeated in the handouts below.

   - Stata program for brief overview of panel data linear models

  Panel Data 1: Discrete-Time Methods for the Analysis of Event Histories (Read this on your own too. I will backtrack to it if time permits but I think the other topics are more important.) Often we are interested not only in whether an event occurs, but how quickly it happens. Drawing on work from Allison, this handout shows how panel data and basic logistic regression techniques can sometimes be used for such purposes.

  Panel Data 2: Setting up Panel data

   - Stata program for setting up panel data

  Panel Data 3: Conditional Logit/ Fixed Effects Logit Models

   - Stata program for clogit/ fixed effects

  Panel Data 4: Fixed Effects vs Random Effects Models

   - Stata program for clogit/ fixed effects

  Multilevel/ Mixed Effects Models: A Brief Overview

   - Stata program for clogit/ fixed effects

Also recommended: (NOTE: xtmelogit has been superceded by melogit)

Assignment 9: Panel Data Methods. Due April 12, 2018

   - do file used in HW # 9


Models for Ordinal Outcomes II: Generalized ordered logit models

The assumptions of the ordered logit model are often violated. The generalized ordered logit model (estimated by gologit2) sometimes provides a viable but still parsimonious alternative.

    GOLOGIT Part 1: The gologit model & gologit2 program (Powerpoint version)

    GOLOGIT Part 2: Interpretation of results. (Powerpoint version) - Also get this handout

    Updates to gologit2: This describes major updates to the program since it was released in 2006.

For more detail, you should read the 2006 Stata Journal article that introduced the program and the 2016 Journal of Mathematical Sociology article on how to interpret results. The JMS reading requires an accout to access; others can find it described at

Models for Ordinal Outcomes III: Heterogeneous Choice Models and Other Methods for Comparing Logit & Probit Coefficients Across Groups

To the surprise of many, techniques used for group comparisons in OLS regression (e.g. adding interaction effects) can be highly problematic in logistic and ordinal regression. As Hoetker notes, "in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." We will discuss how heterogeneous choice models and possibly other methods offer possible solutions. The handout covers only the most critical points; for those who want to know more, the article by Allison and the two articles by Williams that are mentioned in the references are highly recommended.

    Comparing Logit and Probit Coefficients Across Groups: Problems, Solutions, and Problems with the Solutions (also available in Powerpoint).

    Handout for Comparing Logit & Probit Coefficients Across Groups 

Assignment 10: Advanced issues/ models for ordinal outcomes Due April 19, 2018

   - do file used in HW # 10


Models for Binary/Proportional Outcomes IV: Special Topics

    Analyzing Rare Events with Logistic Regression. Many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare, e.g. only 20 or 30 people experience the event.. This handout describes the problem and discusses various solutions, with an emphasis on Penalized Maximum Likelihood (aka the Firth Method).

   - Stata program for rare events models

    Analyzing Proportions / Fractional Response Models. In many cases, the dependent variable of interest is a proportion, i.e. its values range between 0 and 1. Wooldridge (1996, 2011) gives the example of the proportion of employees that participate in a company's pension plan. This handout shows that methods used for binary outcomes can easily be adapted to deal with such variables. Other approaches are also discussed.

   - Stata program for models for analyzing proportions

Miscellaneous Topics

    Ordinal Independent Variables. We often want to use ordinal variables as independent/explanatory variables in our models. Rightly or wrongly, it is very common to treat such variables as continuous. This handout discusses when it is appropriate to do so. The handout also discusses other possible strategies that can be employed with ordinal independent variables, such as the use of Sheaf coefficients.

   - do file used in the Ordinal Independent Variables handout