Richard Williams, Notre Dame Sociology

Sociology 73994

Categorical Data Analysis

Richard Williams, Instructor

Spring 2021 [Course Completed]

 


NOTE:  These are the Spring 2021 course notes for my categorical data graduate statistics course.  These pages will be updated whenever I complete another session of the course, and possibly sooner.  Notes from the current semester when the course is being taught can be found here
 

 


NOTE:  My Stata Highlights page includes links to Stata and statistical handouts from my other courses that may interest readers.

This page is under development.  Links will become "live" once the handout is ready.  Click here if you want to see the complete online notes and handouts from the last time the course was taught.  I will be making some changes this  changes this year; but the old notes should be fine for anyone who wants to get a head start on methods we haven't gotten to yet.

Stata is in the labs.  You can also order your own personal copy of Stata through the GradPlan package.  Stata 16 is now out and I encourage you to get it if you want to work off campus. Alternatively you can check out How to Use Stata on the Notre Dame Center for Resource Computing (CRC) Machines. These use Unix and can be accessed remotely over the web. I'd much rather have you use the CRC machines instead of some hand-me-down old version of Stata that has bugs that were fixed years ago or which lacks important new features.

 

NOTE: The following special types of files are used on this web page. Some materials are available only to nd.edu users.

PDF  Pdf files. Require Adobe Acrobat.  Get Acrobat Reader

  Stata files.

Useful sites for learning about Stata and SPSS

Rich Williams' Stata Highlights Page

UCLA's Statistical Computing Resources 
RW Suggestions for Using Stata at Notre Dame 

UCLA's Stata Resources

RW's Suggested downloads

UCLA - How does Stata compare with SAS and SPSS?
Resources for learning Stata Wisconsin SSCC Articles on Statistical Computing
The Stata User Support Page Ben Jann's estout/esttab support page (esttab & estout are great for formatting output from Stata)

Overview.  This course discusses methods and models for the analysis of categorical dependent variables and their applications in social science research. Researchers are often interested in the determinants of categorical outcomes. For example, such outcomes might be binary (lives/dies), ordinal (very likely/ somewhat likely/ not likely, nominal (taking the bus, car, or train to work) or count (the number of times something has happened, such as the number of articles written). When dependent variables are categorical rather than continuous, conventional OLS regression techniques are not appropriate. This course therefore discusses the wide array of methods that are available for examining categorical outcomes.

PDF  Syllabus

Regression Models for Categorical Dependent Variables Using Stata (required text; as far as I know it is cheapest to order directly from the Stata Bookstore)

Fixed Effects Regression Models, by Paul Allison. (required text; order it from Amazon or elsewhere). Notre Dame users can also access an e-version from our library.

Stata Files Stata data files used in the course will usually be stored in this folder.

Recommended Reading (ND.Edu Netid is required for access)

Cloud Storage. I strongly urge you to use something like Dropbox, Google Drive, Microsoft Onedrive, Box, or a similar service to back up your most critical files. With so many free and inexpensive services now available, there is really no excuse for losing more than a day's work, if that. I am very fond of Dropbox but just about anything will do. For more options see Notre Dame Collaboration and File Storage.

Brief Review of Models for Continuous Outcomes (Review this on your own and come to me with questions as needed)

PDF  Review of Multiple Regression

reg01.dta - Data file used in the Stata Regression handout

PDF  Using Stata for OLS Regression  (If you are interested, click here for a similar handout using SPSS)

I. Foundations of categorical data analysis.

This section will go over the basics of logistic regression. It will also go over techniques for making results more interpretable; analyzing data sets with complex sampling schemes; and (possibly) techniques for handling missing data. I call these topics "foundations" because once you understand them it is very easy to extend them to other CDA methods.

Overview of Generalized Linear Models, Maximum Likelihood Estimation

    Introduction to Generalized Linear Models

            L01.do - Stata program for GLM handout

    Maximum Likelihood Estimation & Troubleshooting

            L02.do - Stata program for MLE handout

    Assignment 1: Preliminary Data Analysis & Setup. Due February 11, 2021 .

Models for Binomial Outcomes: Basics of Logistic Regression

Logistic Regression I: Problems with the Linear Probability Model (LPM)

 logit01.do - Stata file(s) used in the logistic regression 1 handout

Logistic Regression II: The Logistic Regression Model (LRM)

 logit02.do - Stata file(s) used in the logistic regression 2 handout

Logistic Regression III: Hypothesis Testing, Comparisons with OLS

 logit03.do - Stata file(s) used in the logistic regression 3 handout

Measures of fit - Pseudo R^2, BIC, AIC (Read on your own; I may just discuss a few highlights)

            L05.do - Stata program for measures of fit

Using Stata for Logistic Regression (be sure to read this on your own, as it covers important details we may not go over in class)

 logistic-stata.do - Stata file(s) used in the using stata for logistic regression handout

 logist.dta - Stata data file used in the Logistic Regression handouts

(Optional) Paul Allison further elaborates on the merits of the logit model. However, Paul von Hippel maintains the LPM often isn't so bad. Allison responds backMarc Bellamare, Christopher Zorn, Dave Giles, and Marc Bellamare (again) further debate the issue. My own feeling is that the best of both worlds is to use logit and then couple it with marginal effects.

    Assignment 2: Basics of Logistic Regression. Due February 25, 2021.

            cruzmake.do - If you are curious, this shows how the fake data were created

            Hw02.do - do file used for the results presented in HW # 2

            cruz.dta - data file used in HW # 2

 

Interpreting results: Adjusted Predictions and Marginal effects

The results from binomial and ordinal models can often be difficult to interpret. All too often, researchers discuss the sign and statistical significance of results but say little about their substantive significance. I will expect every student paper to use the methods described in this section and/or one of the advanced methods we discuss later in the course. I will also expect students to be familiar with some of Long & Freese's spost13 commands, such as mtable and mchange. I will probably go over the first handout fairly carefully and then skim the rest, since they are mostly examples you can go over on your own.

Margins01 and Margins03 will definitely be covered in class. The others you should read on your own and ask questions if you have them.

    Using Stata's Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects (click here for the Powerpoint version).

            Margins01.do - Stata program for margins #1 handout

Also - the Stata Journal article I wrote on this is available for free. For an application of the margins command, see my 2013 article with Lutz Bornmann entitled How to calculate the practical significance of citation impact differences? The do file and data file required to replicate the paper's analysis are here.

   Marginal Effects for Continuous Variables. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them. Read this on your own if I don't cover it in class.

            Margins02.do - Stata program for margins #2 handout

   Understanding & Interpreting the Effects of Continuous Variables: The MCP Command. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.

            Margins03.do - Stata program for margins #3 handout

   Using the spost13 commands for adjusted predictions and marginal effects with binary dependent variables. This is pretty easy so I may just have you read this on your own.

            Margins04.do - Stata program for margins #4 handout

   Adjusted predictions and marginal effects for multiple outcome commands and models. Commands like ologit, oprobit, mlogit, oglm, and gologit2 estimate models where there are multiple outcomes, e.g. the dependent variables is ordinal and has 5 categories. The same margins and spost13 commands can generally be used with each. This handout explains how. I am putting this handout here so it is with all the other margins handouts but you will probably to read it after we've covered commands like ologit and mlogit.

            Margins05.do - Stata program for margins #5 handout

    Assignment 3: Interpreting results: Adjusted Predictions and Marginal Effects. Due March 4, 2021.

            Hw03.do - do file used for the results presented in HW # 3

            cruz.dta - data file used in HW # 3

Categorical Data Analysis with Complex Survey Designs

By default, most statistical techniques assume that data were collected via simple random sampling. This is often not true for large national data sets. Fortunately, Stata makes it easy to analyze such data, but there are some important differences in how you go about testing hypotheses and assessing model fit.

  Analyzing Complex Survey Data: Some key issues to be aware of. I've consolidated what had been a few separate handouts. Key issues for both OLS regression and categorical data analysis are discussed.

            svy01.do - Stata program for svy analysis handout

        Also worth seeing:    UCLA's (see lower third of page) and StataCorp's FAQS on Survey Data Analysis

Missing data

I am mostly covering this here because it is an important topic and there wasn't enough time to cover it in the new Stats I! But, several of the methods do involve the use of categorical data analysis, so it isn't totally out of place. Also, since you are analyzing your own data this semester, you are more likely to have to deal with missing data.

Missing Data Part 1: Overview, Traditional Methods. Read on your own. It mostly explains why most traditional methods for handling missing data (other than listwise deletion) are seriously flawed. I won't talk about this in class.

  mdpart1.do, md.dta - Stata files used in the Missing Data Part 1 handout & in the homework

Missing Data Part 2: Multiple Imputation & Maximum Likelihood

mdpart2.do - Stata file for the MD Part 2 handout

Also worth seeing: The Wisconsin Social Science Computing Cooperative has some great pages on MI. Also possibly helpful are the Statacorp FAQs on MI and this page from UCLA.

    Assignment 4: Complex survey Designs; Multiple Imputation. Due March 11, 2021

            Hw04.do - do file used for the results presented in HW # 4

 


II. Intermediate CDA Methods

Here we will talk about other commonly used CDA methods, including ordinal regression, models for multinomial outcomes, and models for count outcomes.

Models for Ordinal Outcomes I: The ordered logit and interval regression models

(Optional but highly recommended) As part of the Sage Research Methods Foundations Project (SRMF), Williams and Quiroz (2019) provide an overview of Ordinal Regression Models. Both basic and more advanced methds (e.g. interval interval regression, generalized ordered logit models, heterogeneous choice models) are discussed. Those with an ND.edu account can access it here. For those not at ND, if your library has purchased SRMF (and if it hasn't it should!) the entry can be found at https://methods.sagepub.com/Foundations/ordinal-regression-models.

    Ordinal Logit Models: Basic & Intermediate Topics. After reading this you will also want to look at the Margins05 handout if you haven't already.

            ologit1.do - Stata program for ologit overview

(Optional) It doesn't have much to do with statistics, but here is the true story of the engineer who tried to save Challenger.

Interval Regression

            intreg2.do - Stata program for interval regression

Ordinal Independent Variables. We often want to use ordinal variables as independent/explanatory variables in our models. Rightly or wrongly, it is very common to treat such variables as continuous. This handout discusses when it is appropriate to do so. The handout also discusses other possible strategies that can be employed with ordinal independent variables, such as the use of Sheaf coefficients. Alternatively, you can read my entry on this in Sage Research Methods Foundations.

            OrdinalIndependentVars.do - do file used in the Ordinal Independent Variables handout

 

Models for Multinomial Outcomes

When categories are unordered, Multinomial Logistic regression is one often-used strategy. We will discuss several ways to aid in the interpretation and testing of these models.

    Multinomial Logit - Overview

            mlogit1.do - Stata program for mlogit, including adjusted predictions & marginal effects

    Other Post-Estimation Commands for mlogit

            mlogit2.do - Stata program for other mlogit post-estimation commands

    Assignment 5: Ordinal and Multinomial Models Due March 18, 2021.

            Hw05.do - do file used in HW # 5

    Assignment 6: Paper Proposal - Due March 25, 2021. Send copies to both the instructor and the TA.

 

Models for Count Outcomes

Variables that count the # of times something happens are common in the Social Sciences. For example, Long examined the # of publications by scientists. Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. In this section we will examine some of the many models that deal explicitly with count outcomes.

    Count Models

            count01.do - Stata program for count models

Assignment 7: Count Models - Due April 1, 2021.

            Hw07.do - do file used in HW # 7


 

III. Advanced Topics (Subject to Change or Re-Ordering)

 

Panel Data/ Multilevel Models

Sometimes the same individuals (or nations, or companies) are measured at multiple points in time. The statistical technique used needs to reflect the fact that the different measurements are not independent of each other. This is a big topic and goes well beyond Categorical Data Analysis, but a few basic commands, e.g. xtlogit, will be discussed.

NOTE: I AM ACTUALLY GOING TO USE SOME OF THE NOTES FOR A MINI-COURSE I TAUGHT DURING SUMMER 2018 IN TAIWAN.  THE COMPLETE SET OF NOTES FROM THAT CLASS ARE HERE.

    Introduction (The course outlines lists all the topics covered in Taiwan. We are only doing the first few.)

    Setting up the data

            datasetup.do

    Fixed effects and conditional logit models

            fixedeffects.do

    Fixed effects versus random effects models

            fixedvsrandom.do

    Basic Multilevel models

            multilevel.do

Also recommended:

    http://www.statalist.org/forums/forum/general-stata-discussion/general/1341778-how-do-xtlogit-and-melogit-differ

    http://stats.idre.ucla.edu/stata/dae/mixed-effects-logistic-regression/ (NOTE: xtmelogit has been superceded by melogit)

Assignment 8: Panel Data Methods. Due April 8, 2021.

            Hw08.do - do file used in HW # 8

 

NOTE: I've covered enough to give you some basic competency with Panel and Multilevel Models, but the Taiwan page has a lot more if you want it. Hybrid models are a way of estimating both fixed and random effects in the same model (albeit with some limitations). You can do adjusted predictions and marginal effects with random effects models. This far-from-finished presentation and handout show an application of many multilevel model methods, including random slopes models. You can do panel data linear models too. Sometimes you are not interested with whether an event occurs, but how quickly (e.g. what factors make people die more quickly?) In such cases, Discrete Time Methods for Event History Analysis can sometimes be a good way to go.

 

Models for Binary Outcomes II: Intermediate Logistic Regression

    The Latent Variable Model for Binary Regression

            L03.do - Stata program for Latent variable handout

    Standardized Coefficients in Logistic Regression

            L04.do - Stata program for standardized coefficients

    Alternatives to logistic regression

Models for Binary Outcomes III: Comparing logit and probit coefficients across nested models

    Prelude to Comparing Logit & Probit Coefficients Between Nested Models

            Nested01.do - Stata program for Prelude to comparing coefficients across nested models

    Comparing Logit & Probit Coefficients Between Models  (click here for Powerpoint version)

    Handout for Comparing Logit & Probit Coefficients Between Models 

    Comparing Logit & Probit Coefficients Between Nested Models (Extended Version). OPTIONAL.  This is actually an older version of the handout but it includes several additional points that might be helpful.

Assignment 9: Intermediate issues in logistic regression analysis. Due April 15, 2021.

            Hw09.do - do file used in HW # 9

 

Models for Ordinal Outcomes II: Generalized ordered logit models

The assumptions of the ordered logit model are often violated. The generalized ordered logit model (estimated by gologit2) sometimes provides a viable but still parsimonious alternative.

    GOLOGIT Part 1: The gologit model & gologit2 program (Powerpoint version)

    GOLOGIT Part 2: Interpretation of results. (Powerpoint version) - Also get this handout

    Updates to gologit2: This describes major updates to the program since it was released in 2006.

For more detail, you should read the 2006 Stata Journal article that introduced the program and the 2016 Journal of Mathematical Sociology article on how to interpret results. The JMS reading requires an nd.edu accout to access; others can find it described at http://www.tandfonline.com/doi/full/10.1080/0022250X.2015.1112384.

Models for Ordinal Outcomes III: Heterogeneous Choice Models and Other Methods for Comparing Logit & Probit Coefficients Across Groups

To the surprise of many, techniques used for group comparisons in OLS regression (e.g. adding interaction effects) can be highly problematic in logistic and ordinal regression. As Hoetker notes, "in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." We will discuss how heterogeneous choice models and possibly other methods offer possible solutions. The handout covers only the most critical points; for those who want to know more, the article by Allison and the two articles by Williams that are mentioned in the references are highly recommended.

    Comparing Logit and Probit Coefficients Across Groups: Problems, Solutions, and Problems with the Solutions (also available in Powerpoint).

    Handout for Comparing Logit & Probit Coefficients Across Groups 

Assignment 10: Advanced issues/ models for ordinal outcomes. Due April 22, 2021*

            Hw10.do - do file used in HW # 10

 

Models for Binary/Proportional Outcomes IV: Special Topics

    Analyzing Rare Events with Logistic Regression. Many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare, e.g. only 20 or 30 people experience the event.. This handout describes the problem and discusses various solutions, with an emphasis on Penalized Maximum Likelihood (aka the Firth Method).

            RareEvents.do - Stata program for rare events models

            Strongly Recommended: Analysis of Rare Events, by Heinz Leitgob

 

    Analyzing Proportions / Fractional Response Models. In many cases, the dependent variable of interest is a proportion, i.e. its values range between 0 and 1. Wooldridge (1996, 2011) gives the example of the proportion of employees that participate in a company's pension plan. This handout shows that methods used for binary outcomes can easily be adapted to deal with such variables. Other approaches are also discussed.

            fracmodels.do - Stata program for models for analyzing proportions

            Strongly Recommended: Analysis of Proportions, by Maarten L Buis