NOTE: The following special types of files are used on this web page. Some materials are available only to nd.edu users.
Pdf files. Require Adobe Acrobat.
Useful sites for learning about Stata and SPSS
UCLA's Statistical Computing Resources RW Suggestions for Using Stata at Notre Dame UCLA's SPSS Starter Kit Resources for learning Stata UCLA - How does Stata compare with SAS and SPSS? The Stata User Support Page Ben Jann's estout/esttab support page (esttab & estout are great for formatting output from Stata)
Overview. This course discusses methods and models for the analysis of categorical dependent variables and their applications in social science research. Researchers are often interested in the determinants of categorical outcomes. For example, such outcomes might be binary (lives/dies), ordinal (very likely/ somewhat likely/ not likely, nominal (taking the bus, car, or train to work) or count (the number of times something has happened, such as the number of articles written). When dependent variables are categorical rather than continuous, conventional OLS regression techniques are not appropriate. This course therefore discusses the wide array of methods that are available for examining categorical outcomes.
Regression Models for Categorical Dependent Variables Using Stata (required text; as far as I know it is cheapest to order directly from the Stata Bookstore)
Fixed Effects Regression Models, by Paul Allison. (required text; order it from Amazon or elsewhere)
Stata Files Stata data files used in the course will usually be stored in this folder.
Recommended Reading (ND.Edu Netid is required for access)
Dropbox. I strongly encourage you to set up a Dropbox account if you do not already have one. Dropbox gives you a minimum of 2GB of free online storage. More critically, with Dropbox you can set up shared folders. This makes it much easier when you want me or others to help you with your research. You can create a folder, put your data and programs in it, and then share the folder with me. If you set up an account use your .edu email address because you can get more bonus storage that way. For more click on this link. If necessary, things like Google Drive or Box can also possibly be used, but I find Dropbox easiest.
Brief Review of Models for Continuous Outcomes (Review this on your own and come to me with questions as needed)
Review of Multiple Regression
reg01.dta - Data file used in the Stata Regression handout
Using Stata for OLS Regression (If you are interested, click here for a similar handout using SPSS)
I. Foundations of categorical data analysis.
This section will go over the basics of logistic regression. It will also go over techniques for making results more interpretable; analyzing data sets with complex sampling schemes; and (possibly) techniques for handling missing data. I call these topics "foundations" because once you understand them it is very easy to extend them to other CDA methods.
Overview of Generalized Linear Models, Maximum Likelihood Estimation
Introduction to Generalized Linear Models
L01.do - Stata program for GLM handout
Maximum Likelihood Estimation (Read on your own)
L02.do - Stata program for MLE handout
Assignment 1: Preliminary Data Analysis & Setup. Due January 26.
Models for Binomial Outcomes: Basics of Logistic Regression
Logistic Regression I: Problems with the Linear Probability Model (LPM)
logit01.do - Stata file(s) used in the logistic regression 1 handout
Logistic Regression II: The Logistic Regression Model (LRM)
logit02.do - Stata file(s) used in the logistic regression 2 handout
Logistic Regression III: Hypothesis Testing, Comparisons with OLS
logit03.do - Stata file(s) used in the logistic regression 3 handout
Using Stata for Logistic Regression (be sure to read this on your own, as it covers important details we may not go over in class)
logistic-stata.do - Stata file(s) used in the using stata for logistic regression handout
logist.dta - Stata data file used in the Logistic Regression handouts
Measures of fit - Pseudo R^2, BIC, AIC (Read on your own; I may just discuss a few highlights)
L05.do - Stata program for measures of fit
(Optional) Paul Allison further elaborates on the merits of the logit model. However, Paul von Hippel maintains the LPM often isn't so bad. Marc Bellamare, Christopher Zorn, Dave Giles, and Marc Bellamare (again) further debate the issue. My own feeling is that the best of both worlds is to use logit and then couple it with marginal effects.
Assignment 2: Basics of Logistic Regression. Due February 2.
cruzmake.do - If you are curious, this shows how the fake data were created
Hw02.do - do file used for the results presented in HW # 2
cruz.dta - data file used in HW # 2
Interpreting results: Adjusted Predictions and Marginal effects
The results from binomial and ordinal models can often be difficult to interpret. All too often, researchers discuss the sign and statistical significance of results but say little about their substantive significance. I will expect every student paper to use the methods described in this section and/or one of the advanced methods we discuss later in the course. I will also expect students to be familiar with some of Long & Freese's spost13 commands, such as mtable and mchange.
Using Stata's Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects (click here for the Powerpoint version).
Margins01.do - Stata program for margins #1 handout
Also - the Stata Journal article I wrote on this is available for free. For an application of the margins command, see my 2013 article with Lutz Bornmann entitled How to calculate the practical significance of citation impact differences? Also available here. The do file and data file required to replicate the paper's analysis are here.
Marginal Effects for Continuous Variables. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.
Margins02.do - Stata program for margins #2 handout
Understanding & Interpreting the Effects of Continuous Variables: The MCP Command. We have mostly been talking about adjusted prediction and marginal effects for categorical variables. This handout explains how marginal effects for continuous variables are estimated and interpreted, and includes additional technical details for those who want them.
Margins03.do - Stata program for margins #3 handout
Using the spost13 commands for adjusted predictions and marginal effects with binary dependent variables. This is pretty easy so I may just have you read this on your own.
Margins04.do - Stata program for margins #4 handout
Adjusted predictions and marginal effects for multiple outcome commands and models. Commands like ologit, oprobit, mlogit, oglm, and gologit2 estimate models where there are multiple outcomes, e.g. the dependent variables is ordinal and has 5 categories. The same margins and spost13 commands can generally be used with each. This handout explains how.
Margins05.do - Stata program for margins #5 handout
Assignment 3: Interpreting results: Adjusted Predictions and Marginal effects Due February 9.
Hw03.do - do file used for the results presented in HW # 3
cruz.dta - data file used in HW # 3
Categorical Data Analysis with Complex Survey Designs
By default, most statistical techniques assume that data were collected via simple random sampling. This is often not true for large national data sets. Fortunately, Stata makes it easy to analyze such data, but there are some important differences in how you go about testing hypotheses and assessing model fit.
Analyzing Complex Survey Data: Some key issues to be aware of . I've consolidated what had been a few separate handouts. Key issues for both OLS regression and categorical data analysis are discussed.
svy01.do - Stata program for svy analysis handout
Also worth seeing: UCLA's (see lower third of page) and StataCorp's FAQS on Survey Data Analysis
I am mostly covering this here because it is an important topic and there wasn't enough time to cover it in the new Stats I! But, several of the methods do involve the use of categorical data analysis, so it isn't totally out of place. This will be review for those of you who have had me before.
Missing Data Part 1: Overview, Traditional Methods. Read on your own. It mostly explains why most traditional methods for handling missing data (other than listwise deletion) are seriously flawed. I won't talk about this in class.
mdpart1.do, md.dta - Stata files used in the Missing Data Part 1 handout & in the homework
Missing Data Part 2: Multiple Imputation & Maximum Likelihood
mdpart2.do - Stata file for the MD Part 2 handout
Also worth seeing: The Wisconsin Social Science Computing Cooperative has some great pages on MI. Also possibly helpful are the Statacorp FAQs on MI and this page from UCLA.
Assignment 4: Complex survey Designs; Multiple Imputation Due February 16.
Hw04.do - do file used for the results presented in HW # 4
II. Intermediate CDA Methods
Here we will talk about other commonly used CDA methods, including ordinal regression, models for multinomial outcomes, and models for count outcomes.
Models for Ordinal Outcomes I: The ordered logit and interval regression models
Ordinal Logit Models: Basics
ologit1.do - Stata program for ologit overview
(Optional) It doesn't have much to do with statistics, but here is the true story of the engineer who tried to save Challenger.
intreg2.do - Stata program for interval regression
Models for Multinomial Outcomes
When categories are unordered, Multinomial Logistic regression is one often-used strategy. We will discuss several ways to aid in the interpretation and testing of these models.
Multinomial Logit - Overview
mlogit1.do - Stata program for mlogit, including adjusted predictions & marginal effects
Other Post-Estimation Commands for mlogit
mlogit2.do - Stata program for other mlogit post-estimation commands
Assignment 5: Ordinal and Multinomial Models Due March 2.
Hw05.do - do file used in HW # 5
Models for Count Outcomes
Variables that count the # of times something happens are common in the Social Sciences. For example, Long examined the # of publications by scientists. Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. In this section we will examine some of the many models that deal explicitly with count outcomes.
count01.do - Stata program for count models
Assignment 6: Paper Proposal - Due March 9. Send copies to both the instructor and the TA
Assignment 7: Count Models - Due March 23
Hw07.do - do file used in HW # 7
III. Advanced Topics (Subject to Change or Re-Ordering)
Models for Binary Outcomes II: Intermediate Logistic Regression
The Latent Variable Model for Binary Regression
L03.do - Stata program for Latent variable handout
Standardized Coefficients in Logistic Regression
L04.do - Stata program for standardized coefficients
Alternatives to logistic regression
Models for Binary Outcomes III: Comparing logit and probit coefficients across nested models
Prelude to Comparing Logit & Probit Coefficients Between Nested Models
Nested01.do - Stata program for Prelude to comparing coefficients across nested models
Comparing Logit & Probit Coefficients Between Models (click here for Powerpoint version)
Handout for Comparing Logit & Probit Coefficients Between Models
Comparing Logit & Probit Coefficients Between Nested Models (Extended Version). OPTIONAL. This is actually an older version of the handout but it includes several additional points that might be helpful.
Assignment 8: Intermediate issues in logistic regression analysis Due March 30
Hw08.do - do file used in HW # 7
Models for Ordinal Outcomes II: Generalized ordered logit models
The assumptions of the ordered logit model are often violated. The generalized ordered logit model (estimated by gologit2) sometimes provides a viable but still parsimonious alternative.
GOLOGIT Part 1: The gologit model & gologit2 program (Powerpoint version)
GOLOGIT Part 2: Interpretation of results. (Powerpoint version) - Also get this handout
Updates to gologit2: This describes major updates to the program since it was released in 2006.
For more detail, you should read the 2006 Stata Journal article that introduced the program and the 2016 Journal of Mathematical Sociology article on how to interpret results. The JMS reading requires an nd.edu accout to access; others can find it described at http://www.tandfonline.com/doi/full/10.1080/0022250X.2015.1112384.
Models for Ordinal Outcomes III: Heterogeneous Choice Models and Other Methods for Comparing Logit & Probit Coefficients Across Groups
To the surprise of many, techniques used for group comparisons in OLS regression (e.g. adding interaction effects) can be highly problematic in logistic and ordinal regression. As Hoetker notes, "in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." We will discuss how heterogeneous choice models and possibly other methods offer possible solutions. The handout covers only the most critical points; for those who want to know more, the article by Allison and the two articles by Williams that are mentioned in the references are highly recommended.
Comparing Logit and Probit Coefficients Across Groups: Problems, Solutions, and Problems with the Solutions (also available in Powerpoint).
Handout for Comparing Logit & Probit Coefficients Across Groups
Assignment 9: Advanced issues/ models for ordinal outcomes Due April 6
Hw09.do - do file used in HW # 9
Panel Data/ Multilevel Models
Sometimes the same individuals (or nations, or companies) are measured at multiple points in time. The statistical technique used needs to reflect the fact that the different measurements are not independent of each other. This is a big topic and goes well beyond Categorical Data Analysis, but a few basic commands, e.g. xtlogit, will be discussed.
Panel Data 0: Brief Overview of Linear Models. Linear Models. Read this on your own. It primarily focuses on techniques for panel data with continuous variables. Some key points are repeated in the handouts below.
panel00.do - Stata program for brief overview of panel data linear models
Panel Data 1: Discrete-Time Methods for the Analysis of Event Histories (Read this on your own too. I will backtrack to it if time permits but I think the other topics are more important.) Often we are interested not only in whether an event occurs, but how quickly it happens. Drawing on work from Allison, this handout shows how panel data and basic logistic regression techniques can sometimes be used for such purposes.
Panel Data 2: Setting up Panel data
panel02.do - Stata program for setting up panel data
Panel Data 3: Conditional Logit/ Fixed Effects Logit Models
panel03.do - Stata program for clogit/ fixed effects
Panel Data 4: Fixed Effects vs Random Effects Models
panel04.do - Stata program for clogit/ fixed effects
Multilevel Models [DRAFT -- I want to eventually explain some points in more depth and consider more complicated models]
multilevel.do - Stata program for clogit/ fixed effects
Also recommended (at least until I come up with a more complete handout):
http://stats.idre.ucla.edu/stata/dae/mixed-effects-logistic-regression/ (NOTE: xtmelogit is now called melogit)
Assignment 10: Panel Data Methods Due April 20
Hw10.do - do file used in HW # 10
Models for Binary/Proportional Outcomes IV: Special Topics
Analyzing Rare Events with Logistic Regression. Many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare, e.g. only 20 or 30 people experience the event.. This handout describes the problem and discusses various solutions, with an emphasis on Penalized Maximum Likelihood (aka the Firth Method).
RareEvents.do - Stata program for rare events models
Analyzing Proportions / Fractional Response Models. In many cases, the dependent variable of interest is a proportion, i.e. its values range between 0 and 1. Wooldridge (1996, 2011) gives the example of the proportion of employees that participate in a company's pension plan. This handout shows that methods used for binary outcomes can easily be adapted to deal with such variables. Other approaches are also discussed.
fracmodels - Stata program for models for analyzing proportions
Ordinal Independent Variables. We often want to use ordinal variables as independent/explanatory variables in our models. Rightly or wrongly, it is very common to treat such variables as continuous. This handout discusses when it is appropriate to do so. The handout also discusses other possible strategies that can be employed with ordinal independent variables, such as the use of Sheaf coefficients.
OrdinalIndependentVars.do - do file used in the Ordinal Independent Variables handout