CSE 40657/60657
Homework 4

Fri 2019/11/22 5pm

In this assignment you will build and improve a named entity recognizer using a hidden Markov Model and a structured perceptron.

You may write code in any language you choose. It should build and run on student*, but if this is not possible, please discuss it with the instructor before submitting. You may reuse code from your previous assignments, especially HW2.

Whenever the instructions below say to "report" something, it should be reported in the PDF file that you submit.

1. Hidden Markov Model

  1. Download the HW4 package. It contains the following files:
    train-1k Training data (1k sentences)
    train-10k Training data (10k sentences)
    train-all Training data (all sentences)
    dev Development data
    test Test data: don't peek! Compute precision/recall
    The data is a subset of the OntoNotes corpus. The training stories selected come from Sinorama, a Taiwanese magazine, while the dev and test stories come from blogs. All files have the following format:
    With      O
    a         O
    throw     O
    of        O
    57.28     B-QUANTITY
    meters    I-QUANTITY
    ,         O
    Chiang    B-PERSON
    shattered O
    the       O
    world     O
    record    O
    for       O
    the       O
    javelin   O
    in        O
    the       O
    F12       B-CARDINAL
    visually  O
    -         O
    impaired  O
    class     O
    .         O
    Chiang    B-PERSON
    Blank lines mark sentence boundaries. Each nonblank line contains a word and its corresponding tag: B-type for the beginning of a named entity of type type, I-type for a word inside a named entity of type type, and O for a word outside a named entity. Write code to read in train-1k. Report how many tokens, word types, and tag types there are.1
  2. Train a Hidden Markov Model on train-1k.5
    • If you want to, you can start with your FST from HW2. As before, the model can be thought of as the composition of two FSTs: a bigram tag model exactly as in HW2, and a tag-to-word model, which is like the modern-to-Shakespeare model in HW2, but with all $\varepsilon$-transitions deleted. For each sentence, compose these two FSTs with an FST that accepts just that sentence.
    • Or, you can start from scratch without using FSTs.
    It's important to smooth $p(w \mid t)$; a simple recommendation is to use add-0.1 smoothing, and in the dev/test data, convert all unknown words to a zero-count symbol <unk>. Report2 your model's tag bigram probabilities for:
    p(B-DATE | B-DATE) =
    p(I-DATE | B-DATE) =
         p(O | B-DATE) =
    p(B-DATE | I-DATE) =
    p(I-DATE | I-DATE) =
         p(O | I-DATE) =
         p(B-DATE | O) =
         p(I-DATE | O) =
              p(O | O) =
    and your model's (smoothed) tag-word probabilities for:
         p(one | B-DATE) =
         p(one | I-DATE) =
              p(one | O) =
    p(February | B-DATE) =
    p(February | I-DATE) =
         p(February | O) =

2. Decoding

  1. Implement the Viterbi algorithm (or adapt your implementation from HW2) to work on the present model.5
    • If you started with your FST from HW2, you probably need to make minimal or no modifications to your HW2 code. If you weren't using log-probabilities before, it would be good to start.
    • If you started from scratch, you'll need to reimplement the Viterbi algorithm. Since the HMM has a simple structure, the algorithm is also not too complicated:
      # Assume w[1], ..., w[n-1] is the input string and w[n] = </s>
      # Assume that we have a model with 
      # - tag bigram probabilities P(t'|t)
      # - tag-word probabilities P(w|t), with P(</s>|</s>) = 1
      # Initialize table of Viterbi probabilities
      # viterbi[i,t] is the probability of the best way of tagging words 1...i-1 with some tags and tagging word i as t
      viterbi[i,t] = 0 for i = 0...n+1 and all tags t
      viterbi[0,<s>] = 1
      for i = 1, ..., n
        for each tag t'
          for each tag t
            if viterbi[i-1,t] * P(t'|t) * P(w[i]|t') > viterbi[i,t']:
              viterbi[i,t'] = viterbi[i-1,t] * P(t'|t) * P(w[i]|t')
              pointer[i,t'] = (i-1,t)
      return viterbi[n,</s>]
  2. Tag the development data.1 The output format should be:
    On O O
    the O B-DATE
    attended O O
    the O O
    inauguration O O
    of O O
    President O B-PERSON
    Hipolito B-PERSON I-PERSON
    . O O
    The O O
    where the first column contains the words, the second column contains the correct tags, and the third column contains your system's tags. The conlleval script is a very finicky Perl script; please make sure that the columns are separated by exactly one space character, and that there is no leading or trailing whitespace, and that blank lines do not have any whitespace. Report what your output looks like for the first five sentences of the development set.1
  3. Run the scorer like this:
    perl < dev.out
    where dev.out is replaced with the name of your output file. You should see something like this:
    processed 21608 tokens with 1635 phrases; found: 1652 phrases; correct: 472.
    accuracy:  78.86%; precision:  28.57%; recall:  28.87%; FB1:  28.72
             CARDINAL: precision:  48.48%; recall:  13.91%; FB1:  21.62  33
                 DATE: precision:  20.57%; recall:  28.86%; FB1:  24.02  209
                EVENT: precision:   0.00%; recall:   0.00%; FB1:   0.00  369
                  FAC: precision:   0.00%; recall:   0.00%; FB1:   0.00  1
                  GPE: precision:  87.23%; recall:  39.27%; FB1:  54.16  235
             LANGUAGE: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
                  LAW: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
                  LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  18
                MONEY: precision:  28.57%; recall:  50.00%; FB1:  36.36  14
                 NORP: precision:  93.90%; recall:  45.83%; FB1:  61.60  82
              ORDINAL: precision:  66.67%; recall:  14.63%; FB1:  24.00  9
                  ORG: precision:   9.39%; recall:  12.50%; FB1:  10.73  181
              PERCENT: precision:  50.00%; recall:  25.00%; FB1:  33.33  4
               PERSON: precision:  25.20%; recall:  35.16%; FB1:  29.36  381
             QUANTITY: precision:  60.00%; recall:  25.00%; FB1:  35.29  10
                 TIME: precision:   0.00%; recall:   0.00%; FB1:   0.00  4
          WORK_OF_ART: precision:   0.00%; recall:   0.00%; FB1:   0.00  102
    The important score is FB1 (balanced F-measure) on all types, which is 46.15%. Report your system's score on the development data.1 It should be at least 28%.2

3. Structured Perceptron

In this part, you'll implement a model related to a conditional random field. The model is not a probability model, just a scoring function: $$s(\mathbf{w}, \mathbf{t}) = \sum_{i=1}^n \left(\lambda_1(t_{i-1}, t_i) + \lambda_2(t_i, w_i)\right),$$ where $t_0 = \texttt{<s>}$ and $t_n = w_n = \texttt{</s>}$.
  1. Update your implementation of the Viterbi algorithm to find the $\mathbf{t}$ that maximizes the above score.1 If you're already using log-probabilities, then most likely, no changes at all are needed.
  2. Implement the training algorithm,5 which works like this:
    • initialize all the $\lambda(t, t')$ and $\mu(t, w)$ to zero
    • for each training sentence $\mathbf{w}, \mathbf{t}$:
      • find the tag sequence $\mathbf{p} = p_1 \cdots p_n$ that maximizes $s(\mathbf{w}, \mathbf{p})$
      • for $i \leftarrow 1, \ldots, n$:
        • $\lambda_1(t_{i-1}, t_i) \mathrel{\mathord+\mathord=} 1$
        • $\lambda_2(t_i, w_i) \mathrel{\mathord+\mathord=} 1$
        • $\lambda_1(p_{i-1}, p_i) \mathrel{\mathord-\mathord=} 1$
        • $\lambda_2(p_i, w_i) \mathrel{\mathord-\mathord=} 1$
      • [There used to be two more updates here for the end of sentence, but they are redundant if we are treating $t_n$ as $\texttt{</s>}$]
    The intuition is: If we got the sentence right, then the updates cancel each other out. If we got the sentence wrong, then the updates increase the weights we would have used to get it right and decrease the weights we did use to get it wrong.
  3. Train the perceptron on the training data. After each epoch (pass through the training data), report your accuracy (number of correct tags divided by total number of tags) on both the training data and the development data.1 When done training, the FB1 (not accuracy) score on the development data should reach at least 40%.2 Report any other details, like:1
    • How did you decide when to stop?
    • Did you use any tricks like randomly shuffling the training data, or averaging the weights?
  4. Report2 your model's tag bigram weights for:
    λ₁(B-DATE, B-DATE) =
    λ₁(B-DATE, I-DATE) =
         λ₁(B-DATE, O) =
    λ₁(I-DATE, B-DATE) =
    λ₁(I-DATE, I-DATE) =
         λ₁(I-DATE, O) =
         λ₁(O, B-DATE) =
         λ₁(O, I-DATE) =
              λ₁(O, O) =
    and your model's tag-word weights for:
         λ₂(B-DATE, one) =
         λ₂(I-DATE, one) =
              λ₂(O, one) =
    λ₂(B-DATE, February) =
    λ₂(I-DATE, February) =
         λ₂(O, February) =


Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: