CSE 40/60657: Natural Language Processing

Homework 4

Due: 2014/04/21 11:55 pm
Points: 30

In this assignment you will build a simple parser trained from the ATIS portion of the Penn Treebank.

Download the Homework 4 data from Sakai. It contains the following files:

train.trees Training data

dev.strings Development data

dev.trees

test.strings Test data: don't look at these files!

test.trees

preprocess.py Preprocessor

unknown.py Replace one-count words with <unk>

postprocess.py Postprocessor

evalb.py Compute labeled precision/recall

Try the following:

Run train.trees through preprocess.py and save the output to train.trees.pre. This script makes the trees strictly binary branching. When it binarizes, it inserts nodes with labels of the form X*, and when it removes unary nodes, it fuses labels so they look like X_Y.
Run train.trees.pre through postprocess.py and verify that the output is identical to the original train.trees. This script reverses all the modifications made by preprocess.py.
Run train.trees.pre through unknown.py and save the output to train.trees.pre.unk. This script replaces all words that occurred only once with the special symbol <unk>.

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting. You are free to use any of the Python code provided when you program your own solutions to this assignment. (In particular, the module tree.py has useful code for handling trees.)

First, we will learn a probabilistic CFG from trees, and store it in the following format:
```
NP -> DT NN # 0.5
NP -> DT NNS # 0.5
DT -> the # 1.0
NN -> boy # 0.5
NN -> girl # 0.5
NNS -> boys # 0.5
NNS -> girls # 0.5
```
1. Write code to read in trees and to count all the rules used in each tree. Run your code on train.trees.pre.unk. How many unique rules are there? What are the top five most frequent rules, and how many times did each occur?3
2. Write code to compute the conditional probability of each rule and print the grammar out in the above format. What are the top five highest-probability rules, and what are their probabilities?2
Now write a CKY parser that takes your grammar and a sentence as input, and outputs the highest-probability parse. If you can't find any parse, output a blank line. Don't forget to replace unknown words with <unk>. Don't forget to use log-probabilities to avoid underflow.
1. Run your parser on dev.strings and save the output to dev.parses. Show the output of your parser on the first five lines of dev.strings, along with their log-probabilities (base 10).5
2. Show a plot of parsing time ($y$ axis) versus sentence length ($x$ axis). Use a log-log scale. Estimate the value of $k$ for which $y \approx cx^k$ (you can do a least-squares fit or just eyeball it). Is it close to 3, and why or why not?1
3. Run your parser output through postprocess.py and save the output to dev.parses.post. Evaluate your parser output against the correct trees in dev.trees using the command:
```
python evalb.py dev.parses.post dev.trees
```
  Show the output of this script, including your F1 score, which should be at least 88%.5
Make at least three modifications to try to improve the accuracy of your parser. You can try the techniques described in class, or any other techniques you can think of or find in the literature. Remember that if you modify preprocess.py, you should (probably) also modify postprocess.py accordingly.
1. For each of your three modifications, describe what you did.3×1 Run your modified parser on dev.strings and report your new F1 score.3×1 Do not run your parser on test.strings yet. What helped, or what didn't? Why?3×1
2. Finally, run your best parser on test.strings. What F1 score did you get? Your score should be at least 90%.5
Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
- A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
- All of the code that you wrote.
- A README file with instructions on how to build and run your code on student*.cse.nd.edu. If this is not possible, please discuss with the instructor before submitting.

`train.trees`	Training data
`dev.strings`	Development data
`dev.trees`	Development data
`test.strings`	Test data: don't look at these files!
`test.trees`	Test data: don't look at these files!
`preprocess.py`	Preprocessor
`unknown.py`	Replace one-count words with `<unk>`
`postprocess.py`	Postprocessor
`evalb.py`	Compute labeled precision/recall