Menu

CSE 40657/60657
Homework 4

Due
Mon 2017/11/20 11:55pm
Points
30

In this assignment you will build and improve a named entity recognizer for Tweets. This was the shared task for WNUT 2016. This is a very difficult task, with human performance in the upper 60%'s and machine performance ranging from 20% to 50%.

Clone (don't fork) the HW4 repository. It contains the following files:

train Training data
dev Development data
test Test data: don't peek!
conlleval.pl Compute precision/recall
Note: This data comes unfiltered from Twitter and contains a lot of vulgar language.

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting. You may reuse code from your previous assignments, especially HW2.

Whenever the instructions below say to "report" something, it should be reported in the PDF file that you submit.

1. Hidden Markov Model

  1. The training data has the following format:
    Greek   B-other
    Festival        I-other
    at      O
    St      B-facility
    Johns   I-facility
    before  O
    ASPEN   B-geo-loc
    
    Ew      O
    wait    O
    ...
    
    Blank lines mark sentence boundaries. Each nonblank line contains a word and its corresponding tag: B-type for the beginning of a named entity of type type, I-type for a word inside a named entity of type type, and O for a word outside a named entity. Write code to read in the training data. Report how many tokens, word types, and tag types there are.1
  2. Implement a Hidden Markov Model and train it on the training data.2 The states should simply be the set of tags found in the data. You will need to smooth $p(w \mid t)$. We found performance to be highly sensitive to smoothing; we recommend converting all unknown words to a special symbol <unk> and using add-0.1 smoothing. Report1 your model's tag bigram probabilities for:
    p(B-person | O) =
    p(B-person | B-person) =
    p(I-person | B-person) =
    p(B-person | I-person) =
    p(I-person | I-person) =
    p(O | I-person) =
    
    and your model's (smoothed) tag-word probabilities for:
    p(God | B-person) =
    p(God | O) =
    p(Justin | B-person) =
    p(Justin | O) =
    p(Lindsay | B-person) =
    p(Lindsay | O) =
    
  3. Implement the Viterbi algorithm and run it on the development data.1 The output format should be:
    Thinking O O
    about O O
    Bowlounge B-facility B-person
    on O O
    Saturday O O
    night O O
    . O O
    
    sur O O
    mon O O
    ...
    
    where the first column contains the words, the second column contains the correct tags, and the third column contains your system's tags. Report what your output looks like for the first five sentences of the development set.1
  4. Run the scorer like this:
    perl conlleval.pl < dev.out
    
    where dev.out is replaced with the name of your output file. You should see something like this:
    processed 16261 tokens with 661 phrases; found: 931 phrases; correct: 77.
    accuracy:  71.99%; precision:   8.27%; recall:  11.65%; FB1:   9.67
              company: precision:  85.71%; recall:  15.38%; FB1:  26.09  7
             facility: precision:  17.65%; recall:   7.89%; FB1:  10.91  17
              geo-loc: precision:  60.00%; recall:  15.52%; FB1:  24.66  30
                movie: precision:   0.00%; recall:   0.00%; FB1:   0.00  5
          musicartist: precision:   0.00%; recall:   0.00%; FB1:   0.00  67
                other: precision:   2.63%; recall:   8.33%; FB1:   3.99  419
               person: precision:  10.05%; recall:  21.64%; FB1:  13.73  368
              product: precision:   7.14%; recall:   2.70%; FB1:   3.92  14
           sportsteam: precision:  33.33%; recall:   1.43%; FB1:   2.74  3
               tvshow: precision:   0.00%; recall:   0.00%; FB1:   0.00  1
    
    The important score is FB1 (balanced F-measure) on all types, which is 9.67%. Report your system's score on the development data.1 It should be at least 9%.3

2. Structured Perceptron

  1. Implement a structured perceptron.5 The model is not a probability model, just a scoring function: $$s(\mathbf{w}, \mathbf{t}) = \sum_{i=1}^n \left(s_{\text{t}}(t_{i-1}, t_i) + s_{\text{w}}(t_i, w_i)\right) + s_{\text{t}}(t_n, \texttt{</s>}).$$
  2. Note that if you're using log-probabilities, this is pretty much what your implementation of the Viterbi algorithm already looks like, and, most likely, no changes are needed. The training algorithm works like this:
    • initialize all the $s_{\text{t}}(t, t')$ and $s_{\text{w}}(t, w)$ to zero
    • for each training sentence $\mathbf{w}, \mathbf{t}$:
      • tag $\mathbf{w}$ to obtain a predicted tag sequence $\mathbf{p} = p_1 \cdots p_n$
      • for $i \leftarrow 1, \ldots, n$:
        • $s_{\text{t}}(t_{i-1}, t_i) \mathrel{\mathord+\mathord=} 1$
        • $s_{\text{w}}(t_i, w_i) \mathrel{\mathord+\mathord=} 1$
        • $s_{\text{t}}(p_{i-1}, p_i) \mathrel{\mathord-\mathord=} 1$
        • $s_{\text{w}}(p_i, w_i) \mathrel{\mathord-\mathord=} 1$
      • $s_{\text{t}}(t_n, \texttt{</s>}) \mathrel{\mathord+\mathord=} 1$
      • $s_{\text{t}}(p_n, \texttt{</s>}) \mathrel{\mathord-\mathord=} 1$
    The intuition is: If we got the sentence right, then the updates cancel each other out. If we got the sentence wrong, then the updates increase the weights we would have used to get it right and decrease the weights we actually used.
  3. Train the perceptron on the training data. After each epoch (pass through the training data), report your accuracy (number of correct tags divided by total number of tags) on both the training data and the development data.1 When done training, the FB1 (not accuracy) score on the development data should reach at least 15%.3 Report any other details, like:1
    • How did you decide when to stop?
    • Did you use any tricks like randomly shuffling the training data, or averaging the weights?

3. Features

  1. Now try adding more features to the model. As long as each feature is a function either of $t_{i-1}$ and $t_i$ or of $t_i$ and $w_i$, you should be able to modify your Viterbi algorithm to compute it. You're also welcome to try other modifications besides adding features. In your report, describe each thing you tried,5 and what its effect on training and development accuracy was.1
  2. Finally, run your best model on the test set and report your FB1 (not accuracy) score.1 A small part of your grade will depend on how you did relative to the rest of the class.3

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: