CSE 40657/60657: Natural Language Processing

Whenever the instructions below say to "report" something, it should be reported in the README.md file that you submit.

1. Setup

Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:

file	description
train	training data
dev	development/validation data
test	test data
layers.py	possibly useful neural network layers
seqlabel.py	compute F1 score of sequence labels

The data files have one sentence per line. In train, valid, and test, each space-separated token is of the form word:label.

For all parts of this assignment, you are free to reuse your code or someone else's from HW3.
Write code to read in the training data. Please report how many unique labels are in the data (not including any special labels that you add for BOS or EOS).2
Write code to replace all words seen only once in the training data with UNK.1

2. Transformer

In this part, you’ll implement a sequence labeler that is just a transformer. Since there is no CRF layer on top, it predicts labels independently of one another. If we have a single sentence $w=w_1 \cdots w_n$ with correct labels $t_1 \cdots t_n$, we compute a sequence of encodings $\mathbf{H}_1, \ldots, \mathbf{H}_n$ just as in HW3, then: \begin{align} \mathbf{y}^{(i)} &= \text{SoftmaxLayer}(\mathbf{H}_{i}) & i &= 1, \ldots, n \end{align} where $\text{SoftmaxLayer}$ is a linear layer followed by a log-softmax, as in layers.py. Then, for this single sentence, the loss function to be minimized is $-\sum_{t=1}^{n} \left[ \mathbf{y}^{(t)}\right]_{t_i}$.

Implement the above model.6 For us, the following configuration worked well: a 4-layer 4-head (as in HW3's parser.py) transformer with d=64, using the Adam optimizer with a learning rate of 0.0003. It should take about 2 minutes per epoch.
Implement a labeler that just guesses, for each word, the label with the highest probability for that word.2
Train this model on the training data. After each epoch, label the dev data, report the F1 score (using the function seqlabel.compute_f1), and save your model.1 A good dev F1 is 52–53%. Label the test data and report your test F1. For full credit, your test F1 should be at least 45%.2
Examine the dev outputs and write down any observations you have (e.g., common types of failure cases, conjectures on why some sentences are easier to label than others).2

3. Transformer+CRF

In this part, you’ll add a CRF to your model, which enables it to model dependencies between the labels.

Replace the $\text{SoftmaxLayer}$ with a CRF, as described in the notes.6 We recommend implementing all the speedups described in the notes, especially under "Vectorization."
Implement a labeler that predicts the highest-scoring label sequence.3
Train this model on the training data. Our implementation takes about minutes per epoch. After each epoch, label the dev data, report the F1 score, and save your model.1 A good dev F1 is 64-65%. Label the test data and report your test F1. For full credit, your test F1 should be at least 62%.2
Examine the dev outputs and write down any observations you have (e.g., How did adding the CRF help? Have the common types of failure cases changed from Part 2?).2

Submission

Please read these submission instructions carefully.

Add and commit your submission files to the repository you created in the beginning. The repository should contain:
- All of the code that you wrote.
- Your final model and outputs from Parts 2 and 3.
- A README.md file with
  - instructions on how to build/run your code.
  - Your responses to all of the instructions/questions in the assignment.
To submit:
- Push your work to GitHub and create a release in GitHub by clicking on "Releases" on the right-hand side, then "Create a new release" or "Draft a new release". Fill in "Tag version" and "Release title" with the part number(s) you’re submitting and click "Publish Release".
- If you submit the same part more than once, the grader will grade the latest release for that part.
- For computing the late penalty, the submission time will be considered the commit time, not the release time.

CSE 40657/60657 Homework 4

1. Setup

2. Transformer

3. Transformer+CRF

Submission

CSE 40657/60657
Homework 4