In this assignment you will build and improve a simple parser trained from the ATIS portion of the Penn Treebank. ATIS (Air Traffic Information System) portion consists of short queries and commands spoken by users of a fake robot travel agent.
Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:
data/train.trees | Training data |
data/dev.strings | Development data (strings) |
data/dev.trees | Development data (trees) |
data/test.strings | Test data (strings): don't peek! |
data/test.trees | Test data (trees): don't peek! |
preprocess.py | Preprocessor |
unknown.py | Replace one-count words with <unk> |
postprocess.py | Postprocessor |
trees.py | Tree library |
layers.py | Neural network layers |
evalb.py | Compute labeled precision/recall |
parser.py | Skeleton of neural parser |
You may write code in any language you choose. You are free to use any of the Python code provided when you program your own solutions to this assignment. (In particular, the module trees.py
has useful code for handling trees.)
First, some preliminaries (nothing to report for these steps):
trees.py
pretty-prints trees:
$ head -1 train.trees | python trees.py TOP ┌─────────────────────┴─────────────────────┐ S PUNC │ │ VP . ┌──────────┴──────────┐ VB NP │ ┌───────────┬────┴────┬───────────┐ List NP PP PP SBAR ┌──┴──┐ ┌─┴──┐ ┌─┴─┐ ┌──┴───┐ DT NNS IN NP TO NP WHNP S │ │ │ │ │ │ │ │ the flights from NNP to NNP WDT VP │ │ │ ┌──┴──┐ Baltimore Seattle that VBP PP │ ┌──┴──┐ stop IN NP │ │ in NNP │ Minneapolis
train.trees
through preprocess.py
and save the output to train.trees.pre
. This script makes the trees strictly binary branching. When it binarizes, it inserts nodes with labels of the form X*
, and when it removes unary nodes, it fuses labels so they look like X_Y
.train.trees.pre
through postprocess.py
and verify that the output is identical to the original train.trees
. This script reverses all the modifications made by preprocess.py
.train.trees.pre
through unknown.py
and save the output to train.trees.pre.unk
. This script replaces all words that occurred only once with the special symbol <unk>
.Next, we will learn a probabilistic CFG from trees, and store it in the following format:
NP -> DT NN # 0.5 NP -> DT NNS # 0.5 DT -> the # 1.0 NN -> boy # 0.5 NN -> girl # 0.5 NNS -> boys # 0.5 NNS -> girls # 0.5
train.trees.pre.unk
. Please report: How many unique rules are there?1 What are the top five most frequent rules, and how many times did each occur?2
Now write a CKY parser that takes your grammar and a sentence as input, and outputs the highest-probability parse.6 If you can't find any parse, you can either print out a fall-back parse or a blank line. Don't forget to replace unknown words with <unk>
. Don't forget to use log-probabilities to avoid underflow.
dev.strings
. Run the parses through postprocess.py
and save the result to dev.pcfg_parses
. Show the output of your parser on the first five lines of dev.strings
, along with their log-probabilities (base e).3timeit
module to get more accurate timings. Estimate the value of $k$ for which $y \approx cx^k$ (you can do a least-squares fit or just eyeball it). Is it close to 3, and why or why not?3postprocess.py
again on test.strings
and save the output to test.pcfg_parses
.
Evaluate your parser output against the correct trees in test.trees
using the command:
python evalb.py test.pcfg_parses test.treesShow the output of this script, including your F1 score, which should be at least 90%.3
parser.py
is a simple neural parser, with one missing function, Model.parse()
. See the docstring of this function for what it is required to do. Adapt your code from Part 2 to implement this function.5test.strings
, run it through postprocess.py
, and save the output to test.neural_parses.post
.1 Please report your score,1 which should be about 90%.1 (Unfortunately, at this training data size, the neural network doesn't seem to help much!)
Please read these submission instructions carefully.