CSE 40657/60657
Homework 3

Due
2023/11/08 5:00pm
Points
30

In this assignment you will build and improve a simple parser trained from the ATIS portion of the Penn Treebank. ATIS (Air Traffic Information System) portion consists of short queries and commands spoken by users of a fake robot travel agent.

Setup

Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:

data/train.trees Training data
data/dev.strings Development data (strings)
data/dev.trees Development data (trees)
data/test.strings Test data (strings): don't peek!
data/test.trees Test data (trees): don't peek!
preprocess.py Preprocessor
unknown.py Replace one-count words with <unk>
postprocess.py Postprocessor
trees.py Tree library
layers.py Neural network layers
evalb.py Compute labeled precision/recall
parser.py Skeleton of neural parser

You may write code in any language you choose. You are free to use any of the Python code provided when you program your own solutions to this assignment. (In particular, the module trees.py has useful code for handling trees.)

1. Training a PCFG

First, some preliminaries (nothing to report for these steps):

  1. When run as a script, trees.py pretty-prints trees:
    $ head -1 train.trees | python trees.py
                                      TOP
                 ┌─────────────────────┴─────────────────────┐
                 S                                          PUNC
                 │                                           │
                 VP                                          .
      ┌──────────┴──────────┐
      VB                    NP
      │    ┌───────────┬────┴────┬───────────┐
    List   NP          PP        PP         SBAR
        ┌──┴──┐      ┌─┴──┐    ┌─┴─┐      ┌──┴───┐
        DT   NNS     IN   NP   TO  NP    WHNP    S
        │     │      │    │    │   │      │      │
       the flights from  NNP  to  NNP    WDT     VP
                          │        │      │   ┌──┴──┐
                      Baltimore Seattle that VBP    PP
                                              │  ┌──┴──┐
                                            stop IN    NP
                                                 │     │
                                                in    NNP
                                                       │
                                                  Minneapolis
    
  2. Run train.trees through preprocess.py and save the output to train.trees.pre. This script makes the trees strictly binary branching. When it binarizes, it inserts nodes with labels of the form X*, and when it removes unary nodes, it fuses labels so they look like X_Y.
  3. Run train.trees.pre through postprocess.py and verify that the output is identical to the original train.trees. This script reverses all the modifications made by preprocess.py.
  4. Run train.trees.pre through unknown.py and save the output to train.trees.pre.unk. This script replaces all words that occurred only once with the special symbol <unk>.

Next, we will learn a probabilistic CFG from trees, and store it in the following format:

NP -> DT NN # 0.5
NP -> DT NNS # 0.5
DT -> the # 1.0
NN -> boy # 0.5
NN -> girl # 0.5
NNS -> boys # 0.5
NNS -> girls # 0.5

  1. Write code to read in trees and to count all the rules used in each tree. Run your code on train.trees.pre.unk. Please report: How many unique rules are there?1 What are the top five most frequent rules, and how many times did each occur?2
  2. Write code to compute the conditional probability of each rule and print the grammar out in the above format. Please report: What are the top five highest-probability rules with left-hand side NNP, and what are their probabilities?2

2. Parsing

Now write a CKY parser that takes your grammar and a sentence as input, and outputs the highest-probability parse.6 If you can't find any parse, you can either print out a fall-back parse or a blank line. Don't forget to replace unknown words with <unk>. Don't forget to use log-probabilities to avoid underflow.

  1. Run your parser on dev.strings. Run the parses through postprocess.py and save the result to dev.pcfg_parses. Show the output of your parser on the first five lines of dev.strings, along with their log-probabilities (base e).3
  2. Show a plot of parsing time ($y$ axis) versus sentence length ($x$ axis). Use a log-log scale. If the times are too short, you can use the timeit module to get more accurate timings. Estimate the value of $k$ for which $y \approx cx^k$ (you can do a least-squares fit or just eyeball it). Is it close to 3, and why or why not?3
  3. Run your parser and postprocess.py again on test.strings and save the output to test.pcfg_parses. Evaluate your parser output against the correct trees in test.trees using the command:
    python evalb.py test.pcfg_parses test.trees
    
    Show the output of this script, including your F1 score, which should be at least 90%.3

3. Training a neural parser

  1. The program parser.py is a simple neural parser, with one missing function, Model.parse(). See the docstring of this function for what it is required to do. Adapt your code from Part 2 to implement this function.5
  2. Train the model and save it to a file.1 Please show in your report the output of the trainer.1
  3. Parse test.strings, run it through postprocess.py, and save the output to test.neural_parses.post.1 Please report your score,1 which should be about 90%.1 (Unfortunately, at this training data size, the neural network doesn't seem to help much!)

Please read these submission instructions carefully.

  1. Add and commit your submission files to the repository you created in the beginning. The repository should contain:
    • All of the code that you wrote.
    • Your grammar from Part 1, your outputs from Part 2, and your final model and outputs from Part 3.
    • A README.md file with
      • instructions on how to build/run your code.
      • Your responses to all of the instructions/questions in the assignment.
  2. To submit:
    • Push your work to GitHub and create a release in GitHub by clicking on "Releases" on the right-hand side, then "Create a new release" or "Draft a new release". Fill in "Tag version" and "Release title" with the part number(s) you're submitting and click "Publish Release".
    • If you submit the same part more than once, the grader will grade the latest release for that part.
    • For computing the late penalty, the submission time will be considered the commit time, not the release time.