CSE 40/60657: Natural Language Processing

Homework 2

Due
2014/02/25 11:55 pm
Points
30

In this assignment, you will build a model to remove disfluencies from transcribed English speech. For example, given a sentence like:

uh , first , um , i need to know , uh , how do you feel about , uh , about sending , uh , an elderly , uh , family member to a nursing home ?
the goal is to remove the filler words, restarts, etc., to obtain the sentence:
first , i need to know , how do you feel about sending an elderly family member to a nursing home ?

Download the Homework 2 data from Sakai. It contains the following files, from the Penn Treebank (Switchboard portion):
train.txt Training data
test.txt Testing data
Both files contain lines that look like this:

uh/F ,/F first/N ,/N um/F ,/F i/N need/N to/N know/N ,/N uh/F ,/F how/N do/N you/N feel/N about/R ,/R uh/F ,/F about/N sending/N ,/N uh/F ,/F an/N elderly/N ,/N uh/F ,/F family/N member/N to/N a/N nursing/N home/N ?/N
Each whitespace-separated token has a word and a label, separated by a slash. There are six possible labels:
N Normal word
F Filled pause (uh, um)
E Explicit editing term (I mean)
D Discourse marker (you know)
A Aside
R Word that should be deleted because of a restart

We'll measure accuracy as the percentage of words that are assigned the correct label.

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting.

In the following, point values are written after each requirement, like this.30

  1. First, write a baseline system that just guesses the most frequent label for each word. This is equivalent to a unigram (0th-order) hidden Markov model.
    1. Describe any choices you had to make (particularly regarding unknown words).1
    2. Train the model on the training data. For each label $t$, what is the probability of the word "you" being labeled $t$, that is, $P(t \mid \text{you})$?2 (Note: These should sum to one!)
    3. Test the model on the testing data. Report your accuracy.3 It should be pretty good, at least 85%.3
    4. Show the output of the baseline system on the second line of the testing data (starts with: huh , well , um).1 Try deleting all words not labeled N and see how it looks.
  2. Next, implement a bigram (1st-order) hidden Markov model. You will need to implement both the trainer (count and divide) and the decoder (Viterbi algorithm). You should imagine that there are fake labels $\verb|<s>|$ and $\verb|</s>|$ before the first word and after the last word, respectively.
    1. Describe any choices you had to make (particularly regarding unknown words and/or smoothing).1
    2. Train the model on the training data. Print out the whole (7x7 or 8x8) matrix of probabilities $P(t' \mid t)$. For each label $t$, what is $P(\text{you} \mid t)$?2
    3. Test the model on the testing data. Report your accuracy.3 It should be at least 88%.3
    4. Show the output of the HMM on the second line of the testing data and report the model (log-)probability for this output.1 Try deleting all words not labeled N and see how it looks.
  3. Finally, try extending the model to improve the accuracy of the model further. Some possible ideas:
    • Trigrams or beyond.
    • Refine the set of labels to include information that you think might be useful. Add this information to the training data but not to the test data; instead, after you test on the test data, remove the extra information, then compare against the correct labels.
    • Modify the words to reflect information that you think might be useful.
    Do the following:
    1. Clearly describe what extension you tried.1
    2. Report your accuracy,3 which must be higher than your basic bigram HMM.3 See if you can get the accuracy above 90%.1
    3. Show the output of the HMM on the second line of the testing data.1 Try deleting all words not labeled N and see how it looks.
    4. Briefly write down your conclusions from this experiment.1

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: