Richard Williams, Notre Dame Sociology

Sociology 73994

Categorical Data Analysis

Richard Williams, Instructor

Fall 2024 [In Progress]


NOTE:  My Stata Highlights page includes links to Stata and statistical handouts from my other courses that may interest readers.

Last Year's Notes: Click here if you want to see the complete online notes and handouts from the last time the course was taught.  I will be making some changes this year; but the old notes should be fine for anyone who wants to get a head start on methods we haven't gotten to yet. Links on this page will become "live" once the handout is ready. 

Stata is in the labs.  You can
also order your own personal copy of Stata through the GradPlan package. Stata Student Pricing is especially cheap.  Stata 17 is now out and I encourage you to get it if you want to work off campus. Alternatively you can use the Online Virtual Labs. If you need something more powerful  you can check out How to Use Stata on the Notre Dame Center for Resource Computing (CRC) Machines. These use Unix and can be accessed remotely over the web. I'd much rather have you use the online virtual machines or the CRC machines instead of some hand-me-down old version of Stata that has bugs that were fixed years ago or which lacks important new features.

 

NOTE: The following special types of files are used on this web page. Some materials are available only to nd.edu users.

PDF  Pdf files. Require Adobe Acrobat.  Get Acrobat Reader

  Stata files.

Useful sites for learning about Stata and SPSS

Rich Williams' Stata Highlights Page

UCLA's Statistical Computing Resources 
RW Suggestions for Using Stata at Notre Dame 

UCLA's Stata Resources

RW's Suggested downloads

UCLA - How does Stata compare with SAS and SPSS?
Resources for learning Stata Wisconsin SSCC Articles on Statistical Computing
The Stata User Support Page Ben Jann's estout/esttab support page (esttab & estout are great for formatting output from Stata)
Stata YouTube Video Tutorials Stata FAQS

 

Links to major topics on this page

[Overview] [Review of Continuous Models] [GLMs, MLE] [Basics of logistic regression]

[[Adjusted Predictions and Marginal Effects] [Complex Survey Designs] [Missing Data]

[Ordinal Outcomes I]  [Multinomial Outcomes] [Ordinal Independent Variables]

[Intermediate Logistic Regression] [Nested Models]

[Panel Data/ Multilevel Models]

[Group Comparisons/Hetero Choice] [gologit models]

[Count Outcomes] [Rare Events] [Fractional Response Models]

 

Overview

This course discusses methods and models for the analysis of categorical dependent variables and their applications in social science research. Researchers are often interested in the determinants of categorical outcomes. For example, such outcomes might be binary (lives/dies), ordinal (very likely/ somewhat likely/ not likely, nominal (taking the bus, car, or train to work) or count (the number of times something has happened, such as the number of articles written). When dependent variables are categorical rather than continuous, conventional OLS regression techniques are not appropriate. This course therefore discusses the wide array of methods that are available for examining categorical outcomes.

PDF  Syllabus

Regression Models for Categorical Dependent Variables Using Stata (required text; as far as I know it is cheapest to order directly from the Stata Bookstore)

Fixed Effects Regression Models, by Paul Allison. (required text; order it from Amazon or elsewhere). Notre Dame users can also access an e-version from our library.

Stata Files Stata data files used in the course will usually be stored in this folder.

Recommended Reading (ND.Edu Netid is required for access)

Cloud Storage. I strongly urge you to use something like Dropbox, Google Drive, Microsoft Onedrive, Box, or a similar service to back up your most critical files. With so many free and inexpensive services now available, there is really no excuse for losing more than a day's work, if that. I am very fond of Dropbox but just about anything will do. For more options see Notre Dame Collaboration and File Storage.

Useful Research Resources for Notre Dame Students. ND students can get free/low cost newspaper subscriptions and free online access to millions of research articles. I find these resources incredibly helpful to me in my work. Be sure to check them out!

Basic Dataset Exploration & File Preparation. You should get to know your dataset before you start to analyze it. This handout includes suggestions for using dataset documentation; documenting your own work; Using fre instead of tab1for frequencies; Including options with tab2; coding missing data correctly; converting string variables to numeric variables; Creating Binary or Ordinal Variables from Continuous Measures; Handling More Complicated Data Structures; and Other Useful Resources. The PDF file is bookmarked so you can easily skip to the sections you are interested in. This handout may be especially helpful when doing Homework #1. Here is the do file if you want to replicate its analyis.

A Note on the Treatment of Gender and Race in My Statistics Notes. [DRAFT] Early versions of some of my statistics handouts first saw life more than 30 years ago. There have been many changes in statistical practice and preferred wording over those 30 years. My notes (and for that matter probably most of my publications) reflect common past practices. As I update handouts, I am making changes to reflect current, more inclusive terminology. The handout also lists several resources for those who are interested in studying often-overlooked gender identities such as transgender and non-binary.

Brief Review of Models for Continuous Outcomes (Review this on your own and come to me with questions as needed)

PDF  Review of Multiple Regression

reg01.dta - Data file used in the Stata Regression handout

PDF  Using Stata for OLS Regression  (If you are interested, click here for a similar handout using SPSS)

I. Foundations of categorical data analysis.

This section will go over the basics of logistic regression. It will also go over techniques for making results more interpretable; analyzing data sets with complex sampling schemes; and (possibly) techniques for handling missing data. I call these topics "foundations" because once you understand them it is very easy to extend them to other CDA methods.

Overview of Generalized Linear Models, Maximum Likelihood Estimation

    Introduction to Generalized Linear Models

            L01.do - Stata program for GLM handout

    Maximum Likelihood Estimation & Troubleshooting

            L02.do - Stata program for MLE handout

    Assignment 1: Preliminary Data Analysis & Setup. Due Sept 5. The above handout on Basic Dataset Exploration may be extremely helpful.

Models for Binomial Outcomes: Basics of Logistic Regression

Logistic Regression I: Problems with the Linear Probability Model (LPM)

 logit01.do - Stata file(s) used in the logistic regression 1 handout

Logistic Regression II: The Logistic Regression Model (LRM)

 logit02.do - Stata file(s) used in the logistic regression 2 handout

Logistic Regression III: Hypothesis Testing, Comparisons with OLS

 logit03.do - Stata file(s) used in the logistic regression 3 handout

Measures of fit - Pseudo R^2, BIC, AIC (Read on your own; I may just discuss a few highlights)

            L05.do - Stata program for measures of fit

Using Stata for Logistic Regression (be sure to read this on your own, as it covers important details we may not go over in class)

 logistic-stata.do - Stata file(s) used in the using stata for logistic regression handout

 logist.dta - Stata data file used in the Logistic Regression handouts

(Optional) Paul Allison further elaborates on the merits of the logit model. However, Paul von Hippel maintains the LPM often isn't so bad. Allison responds backMarc Bellamare, Christopher Zorn, Dave Giles, and Marc Bellamare (again) further debate the issue. My own feeling is that the best of both worlds is to use logit and then couple it with marginal effects.

    Assignment 2: Basics of Logistic Regression. Due Sept 19

            cruzmake.do - If you are curious, this shows how the fake data were created

            Hw02.do - do file used for the results presented in HW # 2

            cruz.dta - data file used in HW # 2

 

Interpreting results: Adjusted Predictions and Marginal effects

The results from binomial and ordinal models can often be difficult to interpret. All too often, researchers discuss the sign and statistical significance of results but say little about their substantive significance. I will expect every student paper to use the methods described in this section and/or one of the advanced methods we discuss later in the course. I will also expect students to be familiar with some of Long & Freese's spost13 commands, such as mtable and mchange. I will probably go over the first handout fairly carefully and then skim the rest, since they are mostly examples you can go over on your own.

Margins01 and Margins03 will definitely be covered in class. The others you should read on your own and ask questions if you have them.

    Using Stata's Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects (click here for the Powerpoint version).

            Margins01.do - Stata program for margins #1 handout

Also - the Stata Journal article I wrote on this is available for free. For an application of the margins command, see my 2013 article with Lutz Bornmann entitled How to calculate the practical significance of citation impact differences? The do file and data file required to replicate the paper's analysis are here.

   Marginal Effects for Continuous Variables. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them. Read this on your own if I don't cover it in class.

            Margins02.do - Stata program for margins #2 handout

   Understanding & Interpreting the Effects of Continuous Variables: The MCP Command. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.

            Margins03.do - Stata program for margins #3 handout

   Using the spost13 commands for adjusted predictions and marginal effects with binary dependent variables. This is pretty easy so I may just have you read this on your own.

            Margins04.do - Stata program for margins #4 handout

   Adjusted predictions and marginal effects for multiple outcome commands and models. Commands like ologit, oprobit, mlogit, oglm, and gologit2 estimate models where there are multiple outcomes, e.g. the dependent variables is ordinal and has 5 categories. The same margins and spost13 commands can generally be used with each. This handout explains how. I am putting this handout here so it is with all the other margins handouts but you will probably want to read it after we've covered commands like ologit and mlogit.

            Margins05.do - Stata program for margins #5 handout

  Model Coefficients, Adjusted Predictions, & Marginal Effects: A Summary of How All Three are Related. This handout gives a brief summary of major points that have been covered in this section.

Margins06.do - Stata program for margins #6 handout

    Assignment 3: Interpreting results: Adjusted Predictions and Marginal Effects. Due Sept 26

            Hw03.do - do file used for the results presented in HW # 3

            cruz.dta - data file used in HW # 3

Categorical Data Analysis with Complex Survey Designs

By default, most statistical techniques assume that data were collected via simple random sampling. This is often not true for large national data sets. Fortunately, Stata makes it easy to analyze such data, but there are some important differences in how you go about testing hypotheses and assessing model fit.

  Analyzing Complex Survey Data: Some key issues to be aware of. I've consolidated what had been a few separate handouts. Key issues for both OLS regression and categorical data analysis are discussed.

            svy01.do - Stata program for svy analysis handout

        Also worth seeing:    UCLA's (see lower third of page) and StataCorp's FAQS on Survey Data Analysis. Also Chuck Huber has an excellent video on Specifying the design of your survey data in Stata. (But if your dataset is as complicated as the one in his example, you'd better hope that the study documentation is really clear on how to do things.)

Missing data

I am mostly covering this here because it is an important topic and there wasn't enough time to cover it in the new Stats I! But, several of the methods do involve the use of categorical data analysis, so it isn't totally out of place. Also, since you are analyzing your own data this semester, you are more likely to have to deal with missing data.

Coding Missing Data. This two page handout shows how you can you can use commands like mvdecode to make sure Missing Data is being handles correctly in Stata.

 MDCoding.do - Stata file used in the MDCoding handout

 MDCoding.dta - Stata data file used in the MDCoding handout

Missing Data Part 1: Overview, Traditional Methods. I'll talk about parts of this in the class but you should read the whole thing on your own. It mostly explains why most traditional methods for handling missing data (other than listwise deletion) are seriously flawed.

Missing Data Part 2: Multiple Imputation & Maximum Likelihood This is a very long handout! But it has been reworked so only the first few pages are required. You should read more of the handout if you actually want to use Multiple Imputation in your research.

mdpart2.do - Stata file for the MD Part 2 handout

Also worth seeing: The Wisconsin Social Science Computing Cooperative has some great pages on MI. Also possibly helpful are the Statacorp FAQs on MI and this page from UCLA.

    Assignment 4: Complex survey Designs; Multiple Imputation. Due Oct 3

            Hw04.do - do file used for the results presented in HW # 4

 


II. Intermediate CDA Methods

Here we will talk about other commonly used CDA methods, including ordinal regression and models for multinomial outcomes.

Models for Ordinal Outcomes I: The ordered logit and interval regression models

CDA Students: We will work through the Williams and Quiroz paper on Ordinal Regression Models. pp. 1-18 (The ordered logit/ proportional odds model) and pp. 23-26 (interval regression) are required. ND.Edu users can access the entire paper here. If the paper is unclear, you may also want to look at the older handouts listed below -- they cover less territory but provide more of a blow by blow description of what is being done. The do file required to replicate most of the paper's analysis is sageord.do.

For those not at ND: If your library has purchased Sage Research Methods Foundations (and if it hasn't it should!) the entry can be found at https://methods.sagepub.com/Foundations/ordinal-regression-models. [If you want the paper and can't otherwise access it, you can email me at rwilliam@nd.edu for a copy.]

Also Available to Anyone: While I think the Williams/Quiroz paper is better, older handouts and do files that were incorporated into the paper are available at ologit01, intreg2, ologit1.do, and intreg2.do

Everyone: After reading any of the above you will also want to look at the Margins05 handout if you haven't already.

(Optional): It doesn't have much to do with statistics, but here is the true story of the engineer who tried to save Challenger.

 

Models for Multinomial Outcomes

When categories are unordered, Multinomial Logistic regression is one often-used strategy. We will discuss several ways to aid in the interpretation and testing of these models.

    Multinomial Logit - Overview

            mlogit1.do - Stata program for mlogit, including adjusted predictions & marginal effects

    Other Post-Estimation Commands for mlogit. I'll probably let you go over most or all of this on your own.

            mlogit2.do - Stata program for other mlogit post-estimation commands

 

Ordinal Independent Variables

We often want to use ordinal variables as independent/explanatory variables in our models. Rightly or wrongly, it is very common to treat such variables as continuous. The readings discuss when it is appropriate to do so. They also discuss other possible strategies that can be employed with ordinal independent variables.

CDA Students: We will work through the short Williams paper on Ordinal Independent Variables.

For those not at ND: If your library has purchased Sage Research Methods Foundations (and if it hasn't it should!) the entry can be found at https://methods.sagepub.com/Foundations/ordinal-independent-variables . [If you want the paper and can't otherwise access it, you can email me at rwilliam@nd.edu for a copy.]

Also Available to Anyone: While I think the Williams paper is better, the older handout and do files that were incorporated into the paper are available at OrdinalIndependent.pdf  and OrdinalIndependentVars.do. The handout includes a few things (e.g. Sheaf Coefficients) I decided not to include in the paper.

 

    Assignment 5: Ordinal and Multinomial Models Due Oct 10

            Hw05.do - do file used in HW # 5

    Assignment 6: Paper Proposal - Due Oct 18

 

Models for Binary Outcomes II: Intermediate Logistic Regression

    The Latent Variable Model for Binary Regression

            L03.do - Stata program for Latent variable handout

    Standardized Coefficients in Logistic Regression

            L04.do - Stata program for standardized coefficients

    Alternatives to logistic regression

Models for Binary Outcomes III: Comparing logit and probit coefficients across nested models

    Prelude to Comparing Logit & Probit Coefficients Between Nested Models

            Nested01.do - Stata program for Prelude to comparing coefficients across nested models

CDA Students: I am going to show you my Sage video!!! We will also work through the short Williams & Jorgensen paper that the video is based on, Comparing logit & probit coefficients between nested models.

For those not at ND: If your library has purchased Sage Research Methods Videos (and if it hasn't it should!) the entry can be found at https://methods.sagepub.com/video/comparing-logit-and-probit-coefficients-between-nested-models . The paper is at https://www.sciencedirect.com/science/article/pii/S0049089X22001132  [If you want the paper and can't otherwise access it, you can email me at rwilliam@nd.edu for a copy.]

Replication Materials: Everything needed to replicate the analysis in the paper is contained in SSR2023Appendix. Also, you can download the following files into a directory where Stata can find them: hypothetical.do, nested.ado, and Obama2008.do.

Also Available to Anyone: The earlier handouts that were eventually incorporated into the published paper are Comparing Logit & Probit Coefficients Between Models  (click here for Powerpoint version) and Comparing Logit & Probit Coefficients Between Nested Models (Extended Version). (The latter is actually an older version of the handout but it includes several additional points that might be helpful.)

Assignment 7: Intermediate issues in logistic regression analysis. Due Oct 31

            Hw07.do - do file used in HW # 7

 


 

III. Advanced Topics (Subject to Change or Re-Ordering)

Panel Data/ Multilevel Models

Sometimes the same individuals (or nations, or companies) are measured at multiple points in time. The statistical technique used needs to reflect the fact that the different measurements are not independent of each other. This is a big topic and goes well beyond Categorical Data Analysis, but a few basic commands, e.g. xtlogit, will be discussed.

NOTE: I AM ACTUALLY GOING TO USE SOME OF THE NOTES FOR A MINI-COURSE I TAUGHT DURING SUMMER 2018 IN TAIWAN.  THE COMPLETE SET OF NOTES FROM THAT CLASS ARE HERE.

    Introduction (The course outlines lists all the topics covered in Taiwan. We are only doing the first few.)

    Setting up the data

            datasetup.do

    Fixed effects and conditional logit models

            fixedeffects.do

    Fixed effects versus random effects models

            fixedvsrandom.do

    Basic Multilevel models

            multilevel.do

Also recommended:

    http://www.statalist.org/forums/forum/general-stata-discussion/general/1341778-how-do-xtlogit-and-melogit-differ

    http://stats.idre.ucla.edu/stata/dae/mixed-effects-logistic-regression/ (NOTE: xtmelogit has been superceded by melogit)

Assignment 8: Panel Data Methods. Due Nov 14

            Hw08.do - do file used in HW # 8

 

NOTE: I've covered enough to give you some basic competency with Panel and Multilevel Models, but the Taiwan page has a lot more if you want it. Hybrid models are a way of estimating both fixed and random effects in the same model (albeit with some limitations). You can do adjusted predictions and marginal effects with random effects models. This far-from-finished presentation and handout show an application of many multilevel model methods, including random slopes models. You can do panel data linear models too. Sometimes you are not interested with whether an event occurs, but how quickly (e.g. what factors make people die more quickly?) In such cases, Discrete Time Methods for Event History Analysis can sometimes be a good way to go.

 

Models for Ordinal Outcomes II: Heterogeneous Choice Models and Other Methods for Comparing Logit & Probit Coefficients Across Groups

To the surprise of many, techniques used for group comparisons in OLS regression (e.g. adding interaction effects) can be highly problematic in logistic and ordinal regression. As Hoetker notes, "in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." We will discuss how heterogeneous choice models and possibly other methods offer possible solutions. The handout covers only the most critical points; for those who want to know more, the article by Allison and the two articles by Williams that are mentioned and the article by Mize, Doan and Long that are mentioned in the handout are highly recommended.

    Comparing Logit and Probit Coefficients Across Groups: Problems, Solutions, and Problems with the Solutions (also available in Powerpoint).

    Handout for Comparing Logit & Probit Coefficients Across Groups 

 

Models for Ordinal Outcomes III: Generalized ordered logit models

The assumptions of the ordered logit model are often violated. The generalized ordered logit model (estimated by gologit2) sometimes provides a viable but still parsimonious alternative.

    GOLOGIT Part 1: The gologit model & gologit2 program (Powerpoint version)

    GOLOGIT Part 2: Interpretation of results. (Powerpoint version) - Also get this handout

    Updates to gologit2: This describes major updates to the program since it was released in 2006.

For more detail, you should read the 2006 Stata Journal article that introduced the program and the 2016 Journal of Mathematical Sociology article on how to interpret results. The JMS reading requires an nd.edu accout to access; others can find it described at http://www.tandfonline.com/doi/full/10.1080/0022250X.2015.1112384.

With both of the above, the handout on Adjusted predictions and marginal effects for multiple outcome commands and models is again very helpful.

Assignment 9: Advanced issues/ models for ordinal outcomes. Due Nov 21

            Hw09.do - do file used in HW # 9

 

Models for Count Outcomes

Variables that count the # of times something happens are common in the Social Sciences. For example, Long examined the # of publications by scientists. Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. In this section we will examine some of the many models that deal explicitly with count outcomes.

    Count Models

            count01.do - Stata program for count models

Assignment 10: Count Models - Due Nov 28.

            Hw10.do - do file used in HW # 10

 

 

IV. Special Topics (Time Permitting)

Models for Binary/Proportional Outcomes IV

    Analyzing Rare Events with Logistic Regression. Many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare, e.g. only 20 or 30 people experience the event.. This handout describes the problem and discusses various solutions, with an emphasis on Penalized Maximum Likelihood (aka the Firth Method).

            RareEvents.do - Stata program for rare events models

            Strongly Recommended: Analysis of Rare Events, by Heinz Leitgob. Also Available at https://methods.sagepub.com/foundations/analysis-of-rare-events.

 

    Analyzing Proportions / Fractional Response Models. In many cases, the dependent variable of interest is a proportion, i.e. its values range between 0 and 1. Wooldridge (1996, 2011) gives the example of the proportion of employees that participate in a company's pension plan. This handout shows that methods used for binary outcomes can easily be adapted to deal with such variables. Other approaches are also discussed.

            fracmodels.do - Stata program for models for analyzing proportions

            Strongly Recommended: Analysis of Proportions, by Maarten L Buis. Also available at https://methods.sagepub.com/foundations/analysis-of-proportions.