CSE 40/60657: Natural Language Processing
Homework 2
- Due
- 2014/02/25 11:55 pm
- Points
- 30
In this assignment, you will build a model to remove disfluencies from transcribed English speech. For example, given a sentence like:
uh , first , um , i need to know , uh , how do you feel about , uh , about sending , uh , an elderly , uh , family member to a nursing home ?
the goal is to remove the filler words, restarts, etc., to obtain the sentence:
first , i need to know , how do you feel about sending an elderly family member to a nursing home ?
Download the Homework 2 data from Sakai. It contains the following files, from the Penn Treebank (Switchboard portion):
train.txt | Training data |
test.txt | Testing data |
uh/F ,/F first/N ,/N um/F ,/F i/N need/N to/N know/N ,/N uh/F ,/F how/N do/N you/N feel/N about/R ,/R uh/F ,/F about/N sending/N ,/N uh/F ,/F an/N elderly/N ,/N uh/F ,/F family/N member/N to/N a/N nursing/N home/N ?/N
Each whitespace-separated token has a word and a label, separated by a slash. There are six possible labels:
N | Normal word |
F | Filled pause (uh, um) |
E | Explicit editing term (I mean) |
D | Discourse marker (you know) |
A | Aside |
R | Word that should be deleted because of a restart |
We'll measure accuracy as the percentage of words that are assigned the correct label.
You may write code in any language you choose. It should build and run on student*.cse.nd.edu
, but if this is not possible, please discuss it with the instructor before submitting.
In the following, point values are written after each requirement, like this.30
- First, write a baseline system that just guesses the most frequent label for each word. This is equivalent to a unigram (0th-order) hidden Markov model.
- Describe any choices you had to make (particularly regarding unknown words).1
- Train the model on the training data. For each label $t$, what is the probability of the word "you" being labeled $t$, that is, $P(t \mid \text{you})$?2 (Note: These should sum to one!)
- Test the model on the testing data. Report your accuracy.3 It should be pretty good, at least 85%.3
- Show the output of the baseline system on the second line of the testing data (starts with:
huh , well , um
).1 Try deleting all words not labeledN
and see how it looks.
-
Next, implement a bigram (1st-order) hidden Markov model. You will need to implement both the trainer (count and divide) and the decoder (Viterbi algorithm). You should imagine that there are fake labels $\verb|<s>|$ and $\verb|</s>|$ before the first word and after the last word, respectively.
- Describe any choices you had to make (particularly regarding unknown words and/or smoothing).1
- Train the model on the training data. Print out the whole (7x7 or 8x8) matrix of probabilities $P(t' \mid t)$. For each label $t$, what is $P(\text{you} \mid t)$?2
- Test the model on the testing data. Report your accuracy.3 It should be at least 88%.3
- Show the output of the HMM on the second line of the testing data and report the model (log-)probability for this output.1 Try deleting all words not labeled
N
and see how it looks.
-
Finally, try extending the model to improve the accuracy of the model further. Some possible ideas:
- Trigrams or beyond.
- Refine the set of labels to include information that you think might be useful. Add this information to the training data but not to the test data; instead, after you test on the test data, remove the extra information, then compare against the correct labels.
- Modify the words to reflect information that you think might be useful.
- Clearly describe what extension you tried.1
- Report your accuracy,3 which must be higher than your basic bigram HMM.3 See if you can get the accuracy above 90%.1
- Show the output of the HMM on the second line of the testing data.1 Try deleting all words not labeled
N
and see how it looks. - Briefly write down your conclusions from this experiment.1
Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
- A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
- All of the code that you wrote.
- A README file with instructions on how to build and run your code on
student*.cse.nd.edu
. If this is not possible, please discuss with the instructor before submitting.