CSE 40657/60657
Homework 4

Fri 2018/11/16 5pm

In this assignment you will build and improve a named entity recognizer for Tweets. This was the shared task for WNUT 2016. This is a very difficult task, with human performance in the upper 60%'s and machine performance ranging from 20% to 50%.

Clone (don't fork) the HW4 repository. It contains the following files:

train Training data
dev Development data
test Test data: don't peek! Compute precision/recall
Note: This data comes unfiltered from Twitter and contains a lot of vulgar language.

You may write code in any language you choose. It should build and run on student*, but if this is not possible, please discuss it with the instructor before submitting. You may reuse code from your previous assignments, especially HW2.

Whenever the instructions below say to "report" something, it should be reported in the PDF file that you submit.

1. Hidden Markov Model

  1. The training data has the following format:
    Greek   B-other
    Festival        I-other
    at      O
    St      B-facility
    Johns   I-facility
    before  O
    ASPEN   B-geo-loc
    Ew      O
    wait    O
    Blank lines mark sentence boundaries. Each nonblank line contains a word and its corresponding tag: B-type for the beginning of a named entity of type type, I-type for a word inside a named entity of type type, and O for a word outside a named entity. Write code to read in the training data. Report how many tokens, word types, and tag types there are.1
  2. Implement a Hidden Markov Model and train it on the training data.2 The states should simply be the set of tags found in the data. You will need to smooth $p(w \mid t)$. We found performance to be highly sensitive to smoothing; we recommend converting all unknown words to a special symbol <unk> and using add-0.1 smoothing. Report1 your model's tag bigram probabilities for:
    p(B-person | O) =
    p(B-person | B-person) =
    p(I-person | B-person) =
    p(B-person | I-person) =
    p(I-person | I-person) =
    p(O | I-person) =
    and your model's (smoothed) tag-word probabilities for:
    p(God | B-person) =
    p(God | O) =
    p(Justin | B-person) =
    p(Justin | O) =
    p(Lindsay | B-person) =
    p(Lindsay | O) =
  3. Implement the Viterbi algorithm and run it on the development data.1 The output format should be:
    Thinking O O
    about O O
    Bowlounge B-facility B-person
    on O O
    Saturday O O
    night O O
    . O O
    sur O O
    mon O O
    where the first column contains the words, the second column contains the correct tags, and the third column contains your system's tags. Report what your output looks like for the first five sentences of the development set.1

    Added 2018/11/14: Here are two strategies you can take for doing this.

    • Use existing code from HW2. As before, the model can be thought of as the composition of two FSTs: a bigram tag model exactly as in HW2, and a tag-to-word model, which is like the modern-to-Shakespeare model in HW2, but with all $\varepsilon$-transitions deleted. For each sentence, compose these two FSTs with an FST that accepts just that sentence, and the result should look similar to Figure 5.1 in the notes (page 27).
    • Directly implement the Viterbi algorithm for HMMs, without using the FSTs at all. Since the HMM has a simple structure, the algorithm is also not too complicated:
      # Assume w[1], ..., w[n] is the input string and w[n+1] = 
      # Assume that we have a model with 
      # - tag bigram probabilities P(t'|t)
      # - tag-word probabilities P(w|t), with P(</s>|</s>) = 1
      # Initialize table of Viterbi probabilities
      # viterbi[i,t] is the probability of the best way of tagging words 1...i-1 with some tags and tagging word i as t
      viterbi[i,t] = 0 for i = 0...n+1 and all tags t
      viterbi[0,<s>] = 1
      for i = 1, ..., n+1
        for each tag t'
          for each tag t
            if viterbi[i-1,t] * P(t'|t) * P(w[i]|t') > viterbi[i,t']:
              viterbi[i,t'] = viterbi[i-1,t] * P(t'|t) * P(w[i]|t')
              pointer[i,t'] = (i-1,t)
      return viterbi[n+1,</s>]
  4. Run the scorer like this:
    perl < dev.out
    where dev.out is replaced with the name of your output file. You should see something like this:
    processed 16261 tokens with 661 phrases; found: 931 phrases; correct: 77.
    accuracy:  71.99%; precision:   8.27%; recall:  11.65%; FB1:   9.67
              company: precision:  85.71%; recall:  15.38%; FB1:  26.09  7
             facility: precision:  17.65%; recall:   7.89%; FB1:  10.91  17
              geo-loc: precision:  60.00%; recall:  15.52%; FB1:  24.66  30
                movie: precision:   0.00%; recall:   0.00%; FB1:   0.00  5
          musicartist: precision:   0.00%; recall:   0.00%; FB1:   0.00  67
                other: precision:   2.63%; recall:   8.33%; FB1:   3.99  419
               person: precision:  10.05%; recall:  21.64%; FB1:  13.73  368
              product: precision:   7.14%; recall:   2.70%; FB1:   3.92  14
           sportsteam: precision:  33.33%; recall:   1.43%; FB1:   2.74  3
               tvshow: precision:   0.00%; recall:   0.00%; FB1:   0.00  1
    The important score is FB1 (balanced F-measure) on all types, which is 9.67%. Report your system's score on the development data.1 It should be at least 9%.3

2. Structured Perceptron

In this part, you'll implement a model related to a conditional random field. The model is not a probability model, just a scoring function: $$s(\mathbf{w}, \mathbf{t}) = \sum_{i=1}^n \left(\lambda(t_{i-1}, t_i) + \mu(t_i, w_i)\right) + \lambda(t_n, \texttt{</s>}).$$
  1. Update your implementation of the Viterbi algorithm to find the $\mathbf{t}$ that maximizes the above score.1 If you're already using log-probabilities, then most likely, no changes at all are needed.
  2. Implement the training algorithm,4 which works like this:
    • initialize all the $\lambda(t, t')$ and $\mu(t, w)$ to zero
    • for each training sentence $\mathbf{w}, \mathbf{t}$:
      • find the tag sequence $\mathbf{p} = p_1 \cdots p_n$ that maximizes $s(\mathbf{w}, \mathbf{p})$
      • for $i \leftarrow 1, \ldots, n$:
        • $\lambda(t_{i-1}, t_i) \mathrel{\mathord+\mathord=} 1$
        • $\mu(t_i, w_i) \mathrel{\mathord+\mathord=} 1$
        • $\lambda(p_{i-1}, p_i) \mathrel{\mathord-\mathord=} 1$
        • $\mu(p_i, w_i) \mathrel{\mathord-\mathord=} 1$
      • $\lambda(t_n, \texttt{</s>}) \mathrel{\mathord+\mathord=} 1$
      • $\lambda(p_n, \texttt{</s>}) \mathrel{\mathord-\mathord=} 1$
    The intuition is: If we got the sentence right, then the updates cancel each other out. If we got the sentence wrong, then the updates increase the weights we would have used to get it right and decrease the weights we did use to get it wrong.

    Added 2018/11/14: When I wrote the initial solution, I implemented it from scratch, not using the FST library; unfortunately, the FST library's composition is slow, so if you're using it, here's the trick to get it to train fast (about 1 minute per pass through the training data).

    1. Do a git pull on the HW2 repository to get a new version of The new version has a function compose_right that works just like compose, but assumes that the FST on the right is smaller, and optimizes accordingly. Use this function instead.
    2. Many of you were already doing in HW2, but you only need to compose the tag bigram and tag-to-word models once; you don't need to recompose for every single sentence.
    3. However, during training, you're updating the model parameters after every sentence, and it would be expensive to reweight every single transition. Instead, modify your Viterbi algorithm to ignore the weights stored in the FST object, and compute them on the fly: if trans is a transition, then we have that trans.q == ((t, _), i) and trans.a == (t', w) and trans.r = ((t', _), i+1). You can use those to compute the transition's weight, which is $\lambda(t,t') + \mu(t',w)$.
  3. Train the perceptron on the training data. After each epoch (pass through the training data), report your accuracy (number of correct tags divided by total number of tags) on both the training data and the development data.1 When done training, the FB1 (not accuracy) score on the development data should reach at least 14%.3 Report any other details, like:1
    • How did you decide when to stop?
    • Did you use any tricks like randomly shuffling the training data, or averaging the weights?

3. Features

  1. Now try adding more features to the model. As long as each feature is a function either of $t_{i-1}$ and $t_i$ or of $t_i$ and $w_i$, you should be able to modify your Viterbi algorithm to compute it. You're also welcome to try other modifications besides adding features. In your report, describe each thing you tried,5 and what its effect on training and development accuracy was.1
  2. Finally, run your best model on the test set and report your FB1 (not accuracy) score.1 A small part of your grade will depend on how you did relative to the rest of the class.3


Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: