CSE 40657/60657: Natural Language Processing

Whenever the instructions below say to "report" something, it should be reported in the README.md file that you submit.

1. Setup

Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:

file	description
data/train	training data
data/dev	development/validation data
data/dev.words	development/validation data (words only)
data/test	test data
data/test.words	test data (words only)
layers.py	possibly useful neural network layers
seqlabel.py	compute F1 score of sequence labels

The data files have one sentence per line. In train, valid, and test, each space-separated token is of the form word:label.

Write code to read in the training data. Please report how many unique labels are in the data (not including any special labels that you add for BOS or EOS).2
Write code to replace all words seen only once in the training data with UNK.1

2. RNN

In this part, you’ll implement a sequence labeler that is just an RNN. Since there is no CRF layer on top, it predicts labels independently of one another. Added equations: If we have a single sentence $w=w_1 \cdots w_n$ with correct labels $\mathcal{X} = X_1 \cdots X_n$, we compute: \begin{align} \mathbf{v}^{(t)} &= \text{Embedding}^{\fbox{1}}(w_t) & t &= 1, \ldots, n \\ \mathbf{G} &= \text{RNN}^{\fbox{2}}([\mathbf{v}^{(1)} \cdots \mathbf{v}^{(n)}]^\top) \\ \mathbf{H} &= \text{RNN}^{\fbox{3}}(\mathbf{G}) \\ \mathbf{y}^{(t)} &= \text{SoftmaxLayer}^{\fbox{4}}(\mathbf{H}_{t}) & t &= 1, \ldots, n \end{align} Then, for this single sentence, the loss function to be minimized is $-\sum_{t=1}^{n} \left[\log \mathbf{y}^{(t)}\right]_{X_t}$.

Implement an RNN encoder only (no CRF).5 For each word, use a softmax layer (linear transformation followed by softmax) to predict a label for that word.1 For us, the following configuration worked well: a 2-layer RNN with 200 dimensions per layer, using the Adam optimizer with a learning rate of 0.001.
Implement a labeler that just guesses, for each word, the label with the highest probability for that word.2
Train this model on the training data. After each epoch, label the dev data, report the F1 score (using the function seqlabel.compute_f1), and save your model.1 Label the test data and report your test F1. For full credit, your test F1 should be at least 82%.2
Examine the dev outputs and write down any observations you have (e.g., common types of failure cases, conjectures on why some sentences are easier to label than others).2

3. RNN+CRF

In this part, you’ll add a CRF to your model, which enables it to model dependencies between the labels.

Add a CRF after your RNN encoder, as described in the notes (taking out the softmax layer).6 We recommend implementing all the speedups described in the notes, especially under "Vectorization."
Implement a labeler that predicts the highest-scoring label sequence.3
Train this model on the training data. Budget time for training -- depending on the implementation, it could take about 15 minutes per epoch, and you will probably need to train for 5 or more epochs. After each epoch, label the dev data, report the F1 score, and save your model.1 Label the test data and report your test F1. For full credit, your test F1 should be at least 84%.2
Examine the dev outputs and write down any observations you have (e.g., How did adding the CRF help? Have the common types of failure cases changed from Part 2?).2

Submission

Please read these submission instructions carefully.

Add and commit your submission files to the repository you created in the beginning. The repository should contain:
- All of the code that you wrote.
- Your final model and outputs from Parts 2 and 3.
- A README.md file with
  - instructions on how to build/run your code.
  - Your responses to all of the instructions/questions in the assignment.
To submit:
- Push your work to GitHub and create a release in GitHub by clicking on "Releases" on the right-hand side, then "Create a new release" or "Draft a new release". Fill in "Tag version" and "Release title" with the part number(s) you’re submitting and click "Publish Release".
- If you submit the same part more than once, the grader will grade the latest release for that part.
- For computing the late penalty, the submission time will be considered the commit time, not the release time.

CSE 40657/60657 Homework 4

1. Setup

2. RNN

3. RNN+CRF

Submission

CSE 40657/60657
Homework 4