CSE 40657/60657: Natural Language Processing

Clone (don't fork) the HW4 repository. It contains the following files:

`train`	Training data
`dev`	Development data
`test`	Test data: don't peek!
`conlleval.pl`	Compute precision/recall

Note: This data comes unfiltered from Twitter and contains a lot of vulgar language.

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting. You may reuse code from your previous assignments, especially HW2.

Whenever the instructions below say to "report" something, it should be reported in the PDF file that you submit.

1. Hidden Markov Model

The training data has the following format:
```
Greek   B-other
Festival        I-other
at      O
St      B-facility
Johns   I-facility
before  O
ASPEN   B-geo-loc

Ew      O
wait    O
...
```
Blank lines mark sentence boundaries. Each nonblank line contains a word and its corresponding tag: B-type for the beginning of a named entity of type type, I-type for a word inside a named entity of type type, and O for a word outside a named entity. Write code to read in the training data. Report how many tokens, word types, and tag types there are.1
Implement a Hidden Markov Model and train it on the training data.2 The states should simply be the set of tags found in the data. You will need to smooth $p(w \mid t)$. We found performance to be highly sensitive to smoothing; we recommend converting all unknown words to a special symbol <unk> and using add-0.1 smoothing. Report1 your model's tag bigram probabilities for:
```
p(B-person | O) =
p(B-person | B-person) =
p(I-person | B-person) =
p(B-person | I-person) =
p(I-person | I-person) =
p(O | I-person) =
```
and your model's (smoothed) tag-word probabilities for:
```
p(God | B-person) =
p(God | O) =
p(Justin | B-person) =
p(Justin | O) =
p(Lindsay | B-person) =
p(Lindsay | O) =
```
Implement the Viterbi algorithm and run it on the development data.1 The output format should be:
```
Thinking O O
about O O
Bowlounge B-facility B-person
on O O
Saturday O O
night O O
. O O

sur O O
mon O O
...
```
where the first column contains the words, the second column contains the correct tags, and the third column contains your system's tags. Report what your output looks like for the first five sentences of the development set.1

Run the scorer like this:

perl conlleval.pl < dev.out

where dev.out is replaced with the name of your output file. You should see something like this:

processed 16261 tokens with 661 phrases; found: 931 phrases; correct: 77.
accuracy:  71.99%; precision:   8.27%; recall:  11.65%; FB1:   9.67
          company: precision:  85.71%; recall:  15.38%; FB1:  26.09  7
         facility: precision:  17.65%; recall:   7.89%; FB1:  10.91  17
          geo-loc: precision:  60.00%; recall:  15.52%; FB1:  24.66  30
            movie: precision:   0.00%; recall:   0.00%; FB1:   0.00  5
      musicartist: precision:   0.00%; recall:   0.00%; FB1:   0.00  67
            other: precision:   2.63%; recall:   8.33%; FB1:   3.99  419
           person: precision:  10.05%; recall:  21.64%; FB1:  13.73  368
          product: precision:   7.14%; recall:   2.70%; FB1:   3.92  14
       sportsteam: precision:  33.33%; recall:   1.43%; FB1:   2.74  3
           tvshow: precision:   0.00%; recall:   0.00%; FB1:   0.00  1

The important score is FB1 (balanced F-measure) on all types, which is 9.67%. Report your system's score on the development data.1 It should be at least 9%.3

2. Structured Perceptron

Implement a structured perceptron.5 The model is not a probability model, just a scoring function: $$s(\mathbf{w}, \mathbf{t}) = \sum_{i=1}^n \left(s_{\text{t}}(t_{i-1}, t_i) + s_{\text{w}}(t_i, w_i)\right) + s_{\text{t}}(t_n, \texttt{</s>}).$$

initialize all the $s_{\text{t}}(t, t')$ and $s_{\text{w}}(t, w)$ to zero
for each training sentence $\mathbf{w}, \mathbf{t}$:

tag $\mathbf{w}$ to obtain a predicted tag sequence $\mathbf{p} = p_1 \cdots p_n$
for $i \leftarrow 1, \ldots, n$:

$s_{\text{t}}(t_{i-1}, t_i) \mathrel{\mathord+\mathord=} 1$
$s_{\text{w}}(t_i, w_i) \mathrel{\mathord+\mathord=} 1$
$s_{\text{t}}(p_{i-1}, p_i) \mathrel{\mathord-\mathord=} 1$
$s_{\text{w}}(p_i, w_i) \mathrel{\mathord-\mathord=} 1$

$s_{\text{t}}(t_n, \texttt{</s>}) \mathrel{\mathord+\mathord=} 1$
$s_{\text{t}}(p_n, \texttt{</s>}) \mathrel{\mathord-\mathord=} 1$

Train the perceptron on the training data. After each epoch (pass through the training data), report your accuracy (number of correct tags divided by total number of tags) on both the training data and the development data.1 When done training, the FB1 (not accuracy) score on the development data should reach at least 15%.3 Report any other details, like:1
- How did you decide when to stop?
- Did you use any tricks like randomly shuffling the training data, or averaging the weights?

3. Features

Now try adding more features to the model. As long as each feature is a function either of $t_{i-1}$ and $t_i$ or of $t_i$ and $w_i$, you should be able to modify your Viterbi algorithm to compute it. You're also welcome to try other modifications besides adding features. In your report, describe each thing you tried,5 and what its effect on training and development accuracy was.1
Finally, run your best model on the test set and report your FB1 (not accuracy) score.1 A small part of your grade will depend on how you did relative to the rest of the class.3

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code on student*.cse.nd.edu. If this is not possible, please discuss with the instructor before submitting.

CSE 40657/60657 Homework 4

1. Hidden Markov Model

2. Structured Perceptron

3. Features

Submission

CSE 40657/60657
Homework 4