In this assignment you will build and improve a simple parser trained from the ATIS portion of the Penn Treebank. ATIS (Air Traffic Information System) portion consists of short queries and commands spoken by users of a fake robot travel agent.
Clone (don't fork) the HW3 repository. It contains the following files:
train.trees | Training data |
dev.strings | Development data (strings) |
dev.trees | Development data (trees) |
test.strings | Test data (strings): don't peek! |
test.trees | Test data (trees): don't peek! |
preprocess.py | Preprocessor |
unknown.py | Replace one-count words with <unk> |
postprocess.py | Postprocessor |
evalb.py | Compute labeled precision/recall |
train.trees
through preprocess.py
and save the output to train.trees.pre
. This script makes the trees strictly binary branching. When it binarizes, it inserts nodes with labels of the form X*
, and when it removes unary nodes, it fuses labels so they look like X_Y
.train.trees.pre
through postprocess.py
and verify that the output is identical to the original train.trees
. This script reverses all the modifications made by preprocess.py
.train.trees.pre
through unknown.py
and save the output to train.trees.pre.unk
. This script replaces all words that occurred only once with the special symbol <unk>
.You may write code in any language you choose. It should build and run on student*.cse.nd.edu
, but if this is not possible, please discuss it with the instructor before submitting. You are free to use any of the Python code provided when you program your own solutions to this assignment. (In particular, the module tree.py
has useful code for handling trees.)
NP -> DT NN # 0.5 NP -> DT NNS # 0.5 DT -> the # 1.0 NN -> boy # 0.5 NN -> girl # 0.5 NNS -> boys # 0.5 NNS -> girls # 0.5
train.trees.pre.unk
. How many unique rules are there? What are the top five most frequent rules, and how many times did each occur?3<unk>
. Don't forget to use log-probabilities to avoid underflow.
dev.strings
and save the output to dev.parses
. Show the output of your parser on the first five lines of dev.strings
, along with their log-probabilities (base 10).5postprocess.py
and save the output to dev.parses.post
.
Evaluate your parser output against the correct trees in dev.trees
using the command:
python evalb.py dev.parses.post dev.treesShow the output of this script, including your F1 score, which should be at least 88%.5
preprocess.py
, you should (probably) also modify postprocess.py
accordingly.
dev.strings
and report your new F1 score.3×1 Do not run your parser on test.strings
yet. What helped, or what didn't? Why?3×1
test.strings
. What F1 score did you get? Your score should be at least 90%.5Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
student*.cse.nd.edu
. If this is not possible, please discuss with the instructor before submitting.