CSE 40/60657: Natural Language Processing

Setup

Download the Homework 1 data [tgz] [zip]. It contains three files, train, dev, and test. The files are one document per line, tokenized and lowercased, so you don't have to do any preprocessing. Each line has the format

trump i fully understand .

where the first token is the speaker's name, and the remaining tokens are the words of the document.

In the following, point values are written after each requirement, like this.30

1. Naïve Bayes Classifier

Implement a Naïve Bayes classifier. For reference, the model is defined as: $$\begin{align} P(k, d) &= p(k) \times \prod_{w \in d} p(w \mid k) \\ P(k \mid d) &= \frac{p(k, d)}{\sum_{k'} p(k', d)} \end{align}$$

Write code to read in the training documents and collect counts $c(k)$ (number of documents in class $k$) and $c(k, w)$ for all speakers $k$ and all words $w$. Report $c(k)$ and $c(k, w)$ for $k \in \{\text{clinton}, \text{trump}\}$ and $w \in \{\text{country}, \text{president}\}$.1
Write code to compute the probabilities $p(k)$ and $p(w \mid k)$ for all $k,w$. Train on the training set. Report $p(k)$ and $p(w \mid k)$ for $k \in \{\text{clinton}, \text{trump}\}$ and $w \in \{\text{country}, \text{president}\}$.1
Write code to read in a test document and compute the probabilities $P(k \mid d)$ for each $k$. Report these probabilities for the first line of dev.1 Note that these probabilities should sum to one! If you're running into underflow problems, see the appendix.
Run the classifier on the test set and report your accuracy,2 which should be at least 50%.3
Describe any implementation choices you had to make (e.g., smoothing, log-probabilities).2

2. Logistic Regression

Implement logistic regression. For reference, the model is defined as: $$\begin{align} P(k \mid d) &= \frac{\exp s(k,d)}{Z(d)} \\ s(k, d) &= \lambda(k) + \sum_{w \in d} \lambda(k,w) \\ Z(d) &= \sum_k \exp s(k,d). \end{align}$$

The following steps assume that you are using Autograd, although you are free to use another toolkit or no toolkit.

Write a function that takes three arguments:
- The model: it can be (for example) a NumPy array, or a list/tuple/dict of arrays. It can also be a list/tuple/dict of floats, though this might be slow.
- A document (list of words).
- The correct class (a candidate name).
The function should compute $-\log P(k \mid d)$. Hint: to compute $\log Z(d)$ without underflow/overflow, see the appendix.
Write code to train the model by stochastic gradient descent, and run the trainer on the whole training set. At each iteration, report the negative log-probability of train and the accuracy on dev.1. After training, report $\lambda(k)$ and $\lambda(k,w)$ for $k \in \{\text{clinton}, \text{trump}\}$ and $w \in \{\text{country}, \text{president}\}$.1
Write code to read in a test document and compute the probabilities $P(k \mid d)$ for each $k$. Report these probabilities for the first line of dev.1
Run the classifier on test and report your accuracy,2 which should be at least 50%.3
Describe any implementation choices you had to make (e.g., random shuffling of examples, learning rate, number of iterations).2

3. Features

Experiment with (at least) two new kinds of features.

Extend the code you wrote to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features.2 Report your new accuracy for both naïve Bayes1 and logistic regression.1 Briefly write down your conclusions from this experiment.1
Do the same thing for another kind of feature.5

Submission

Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. Please name your file hw1-netid.tgz.

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

Appendix: Avoiding underflow

In 1c, if you're using log-probabilities, and in 2a, you have to compute something of the form $$\begin{align*} P(k \mid d) &= \frac{\exp f(k,d)}{Z(d)} \\ Z(d) &= \sum_{k'} \exp f(k',d). \end{align*}$$ where $f$ is either $\log p$ or $s$. The problem is that $f$ could be some really negative number like -1000, and taking the exp of it gives 0 because of underflow.

Method 1: Use scipy.misc.logsumexp (or autograd.scipy.misc.logsumexp with Autograd): $$\begin{align*} \DeclareMathOperator*{\logsumexp}{logsumexp} P(k \mid d) &= \frac{\exp f(k,d)}{\exp \log Z(d)} \\ &= \exp \left(f(k,d) - \log Z(d)\right) \\ &= \exp \left(f(k,d) - \log \sum_{k'} \exp f(k', d) \right) \\ &= \exp \left(f(k,d) - \logsumexp_{k'} f(k', d) \right) \end{align*}$$

Method 2 (which is how logsumexp works under the hood): Increasing or decreasing all the $f(k,d)$ by the same value doesn't change anything, because $$\begin{align*} \frac{\exp (f(k,d) - m)}{\sum_{k'} \exp (f(k', d)-m)} &= \frac{\exp f(k,d) / \exp m)}{\sum_{k'} \exp f(k', d) / \exp m} = \frac{\exp (f(k,d))}{\sum_{k'} \exp (f(k', d))}. \end{align*}$$ So choose $m = \max_k f(k,d)$, which shifts the $f(k,d)$ so that the biggest one is zero. Then it's safe to take exps of all of them.

CSE 40657/60657 Homework 1