In this assignment, you will build various text classification models and use them to classify sentences from 2016 presidential debates according to speaker.
Download the Homework 1 data [tgz] [zip]. It contains three files, train
, dev
, and test
. The files are one document per line, tokenized and lowercased, so you don't have to do any preprocessing. Each line has the format
trump i fully understand .
where the first token is the speaker's name, and the remaining tokens are the words of the document.
In the following, point values are written after each requirement, like this.30
Implement a Naïve Bayes classifier. For reference, the model is defined as: $$\begin{align} P(k, d) &= p(k) \times \prod_{w \in d} p(w \mid k) \\ P(k \mid d) &= \frac{p(k, d)}{\sum_{k'} p(k', d)} \end{align}$$
dev
.1 Note that these probabilities should sum to one! If you're running into underflow problems, see the appendix.Implement logistic regression. For reference, the model is defined as: $$\begin{align} P(k \mid d) &= \frac{\exp s(k,d)}{Z(d)} \\ s(k, d) &= \lambda(k) + \sum_{w \in d} \lambda(k,w) \\ Z(d) &= \sum_k \exp s(k,d). \end{align}$$
The following steps assume that you are using Autograd, although you are free to use another toolkit or no toolkit.
train
and the accuracy on dev
.1. After training, report $\lambda(k)$ and $\lambda(k,w)$ for $k \in \{\text{clinton}, \text{trump}\}$ and $w \in \{\text{country}, \text{president}\}$.1dev
.1test
and report your accuracy,2 which should be at least 50%.3Experiment with (at least) two new kinds of features.
Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. Please name your file hw1-netid.tgz.
In 1c, if you're using log-probabilities, and in 2a, you have to compute something of the form $$\begin{align*} P(k \mid d) &= \frac{\exp f(k,d)}{Z(d)} \\ Z(d) &= \sum_{k'} \exp f(k',d). \end{align*}$$ where $f$ is either $\log p$ or $s$. The problem is that $f$ could be some really negative number like -1000, and taking the exp of it gives 0 because of underflow.
Method 1: Use scipy.misc.logsumexp
(or autograd.scipy.misc.logsumexp
with Autograd):
$$\begin{align*}
\DeclareMathOperator*{\logsumexp}{logsumexp}
P(k \mid d) &= \frac{\exp f(k,d)}{\exp \log Z(d)} \\ &= \exp \left(f(k,d) - \log Z(d)\right) \\
&= \exp \left(f(k,d) - \log \sum_{k'} \exp f(k', d) \right) \\
&= \exp \left(f(k,d) - \logsumexp_{k'} f(k', d) \right)
\end{align*}$$
Method 2 (which is how logsumexp
works under the hood): Increasing or decreasing all the $f(k,d)$ by the same value doesn't change anything, because
$$\begin{align*}
\frac{\exp (f(k,d) - m)}{\sum_{k'} \exp (f(k', d)-m)} &= \frac{\exp f(k,d) / \exp m)}{\sum_{k'} \exp f(k', d) / \exp m} = \frac{\exp (f(k,d))}{\sum_{k'} \exp (f(k', d))}.
\end{align*}$$
So choose $m = \max_k f(k,d)$, which shifts the $f(k,d)$ so that the biggest one is zero. Then it's safe to take exps of all of them.