Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: dot

Bill McDonald
Professor of Finance

Thomas A. and James J. Bruder Chair in

   Administrative Leadership

Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Me.jpg

E‑Mail:

mcdonald.1@nd.edu

Address:

335 Mendoza College of Business

University of Notre Dame

Notre Dame, IN  46556

Telephone:

(574) 631‑5137

 

 

Textual Analysis

This page contains some tools that are useful for textual analysis in financial applications and data from some of the textual-related publications I have with Tim Loughran.  The essential method of textual analysis goes by various labels in other disciplines such as content analysis, natural language processing, information retrieval, or computational linguistics.  A growing literature finds significant relations between stock price reactions to the sentiment of information releases as measured by word classifications such as those provided below.

 

If you would like to receive e-mail notifications of updates please send me an e-mail and I’ll put you on the update listserv.  We are currently building a website consolidating software and data for textual analysis in accounting and finance (see http://sraf.nd.edu). The data compilations provided on this website are provided for use by individual researchers.  For commercial licenses please contact us at mcdonald.1@nd.edu.

 

Loughran and McDonald Sentiment Word Lists

 

Note:  We thank Cam Harvey and others who suggested some of the modifications we’ve included in these lists. The word lists are described in:

Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability?  Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65.

 

and

 

Andriy Bodnaruk, Tim Loughran and Bill McDonald, 2015, “Using 10-K Text to Gauge Financial Constraints,” Journal of Financial and Quantitative Analysis, 50:4.

 
All word lists are contained in the Master Dictionary described immediately below.  Each row in the Master Dictionary spreadsheet is a word.  Sentiment word lists are identified by column with members of the given set identified by non-zero entries.  The non-zero entries represent the year in which the word was added to a given sentiment list.

 

For WordStat users:  WordStat .cat and .NFO files

 

Master Dictionary

 

2014 Master Dictionary  (click to download)

Updated: March 2015

·         Derived from release 4.0 of 2of12inf.  Extended to include words appearing in 10-K documents that are not found in the original 2of12inf word list.  In addition to providing a master word list, the dictionary includes statistics for word frequencies in all 10-Ks from 1994-2014 (including 10-X variants).  The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), nine sentiment category identifiers (e.g., negative, positive, uncertainty, litigious, modal, constraining), Harvard Word List identifier, number of syllables, and source for each word.  Detailed documentation appears here.

 

 

IPO Data

 

Tim Loughran and Bill McDonald, 2013, “IPO First-Day Returns, Offer Price Revisions, Volatility, and Form S-1 Language,” Journal of Financial Economics, 109:2, 307-326.

 

Aggregate word list based on the union of negative, uncertainty and weak modal words:

·          Loughran_McDonald_AggregateIPOWordList.txt

Data:

·          Sample of completed IPOs in STATA format

·          Sample of withdrawn IPOs in STATA format

 

Readability – 10-K File Size Data

 

Tim Loughran and Bill McDonald, 2014, “Measuring Readability in Financial Disclosures”, Journal of Finance, 69:4, 1643-1671.

 

See 10-K file summaries below, which contains both gross and net file sizes by filing date, CIK, and form type.

 

 

All SEC EDGAR Filings by Type and Year

 

1993-2015 SEC Filings by Type/Year  Updated: January 2016

·         Excel file containing counts of all SEC filings by form and by year.  Derived from the SEC’s master.idx files. (download)



 

 

 

 

 

10-X File Summaries

 

·          1994-2014 10-X Summaries

 
A 131 meg file containing summary data for all 10-K variants (e.g., 10-K405, 10‑Q, 10‑KSB) for 1994-2014.  In addition to word counts for each of the Loughran/McDonald dictionaries, it contains the filing date, fiscal year end (fye), form-type, file name, SIC, Fama/French Industry(48), total number of words, total number of unique words (i.e., words used one or more times), gross file size, net file size (after pre-parsing for tables, html, etc.), # of ASCII encoded characters, # of HTML characters, # of XBRL characters, and # of Table characters.

o    Documentation for Stage One Parse
This document describes the process that strips the 10-X files down to text files.

o    Documentation of Master Dictionary and Document Dictionaries
This document describes the Stage Two parsing process which parses the stage one files into word counts and file attributes.  The Master Dictionary is also created using the Stage Two parsing process.

·         1994-2014 Document Dictionaries

o    See documentation above.

o    A 10 GB file with a tabulated summary of word counts and document statistics for all 10-X type filings.

 

Plain English Data

 

1994-2009 CIK, 10-K filing date, and Plain English measure

·          A file containing our Plain English measure for 10‑K filings as detailed in:  Tim Loughran and Bill McDonald, Regulation and Financial Disclosure:  The Impact of Plain English, Journal of Regulatory Economics, 45:1, 94-113.

 

FMA 2012 Tutorial Session Slides

 

·          Natural Language Processing and Textual Analysis in Finance and Accounting

 

Stop Word Lists

 

·         Stop words

1.        Generic

2.        Names

3.        Dates and numbers

4.        Geographic

5.        Currencies


 

 

 

 

 

© 2013 University of Notre Dame