CSE 40/60657: Natural Language Processing

Download the Homework 3 data from Sakai, in the Resources section (hw3-data.tgz, hw3-data.zip). It contains the following files, from the Penn Treebank (Switchboard portion):

`train.txt`	Training data
`test.txt`	Testing data

Both files contain lines that look like this:

 uh/F ,/F first/N ,/N um/F ,/F i/N need/N to/N know/N ,/N uh/F ,/F how/N do/N you/N feel/N about/R ,/R uh/F ,/F about/N sending/N ,/N uh/F ,/F an/N elderly/N ,/N uh/F ,/F family/N member/N to/N a/N nursing/N home/N ?/N

Each whitespace-separated token has a word and a label, separated by a slash. There are six possible labels:

`N`	Normal word
`F`	Filled pause (uh, um)
`E`	Explicit editing term (I mean)
`D`	Discourse marker (you know)
`A`	Aside
`R`	Word that should be deleted because of a restart

We'll measure accuracy as the percentage of words that are assigned the correct label.

You may write code in any language you choose. Please include in your README any instructions needed to get your code running.

Wherever the instructions below ask a question or say "describe" or "report" or "show", please include your responses in your PDF report.

In the following, point values are written after each requirement, like this.30

First, write a baseline system that just guesses the most frequent label for each word. This is equivalent to a unigram (0th-order) hidden Markov model.
1. Describe any choices you had to make (particularly regarding unknown words).1
2. Train the model on the training data. For each label $t$, what is the probability of the word "you" being labeled $t$, that is, $P(t \mid \text{you})$?2 (Note: These should sum to one!)
3. Test the model on the testing data. Report your accuracy.3 It should be pretty good, at least 85%.3
4. Show the output of the baseline system on the second line of the testing data (starts with: huh , well , um).1 Try deleting all words not labeled N and see how it looks.
Next, implement a bigram (1st-order) hidden Markov model. You will need to implement both the trainer (count and divide) and the decoder (Viterbi algorithm). You should imagine that there are fake labels $\verb|<s>|$ and $\verb|</s>|$ before the first word and after the last word, respectively.
1. Describe any choices you had to make (particularly regarding unknown words and/or smoothing).1
2. Train the model on the training data. Print out the whole (7x7 or 8x8) matrix of probabilities $P(t' \mid t)$. For each label $t$, what is $P(\text{you} \mid t)$?2
3. Test the model on the testing data. Report your accuracy.3 It should be at least 88%.3
4. Show the output of the HMM on the second line of the testing data and report the model (log-)probability for this output.1 Try deleting all words not labeled N and see how it looks.
Finally, try extending the model to improve the accuracy of the model further. Some possible ideas:
- Trigrams or beyond.
- A recurrent neural network.
- Refine the set of labels to include information that you think might be useful. Add this information to the training data but not to the test data; instead, after you test on the test data, remove the extra information, then compare against the correct labels.
- Modify the words to reflect information that you think might be useful.
Do the following:
1. Clearly describe what extension you tried.1
2. Report your accuracy,3 which must be higher than your basic bigram HMM.3 See if you can get the accuracy above 90%.1
3. Show the output of the HMM on the second line of the testing data.1 Try deleting all words not labeled N and see how it looks.
4. Briefly write down your conclusions from this experiment.1

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 3

CSE 40657/60657
Homework 3