CSE 40657/60657: Natural Language Processing

Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:

data/train.*	training data
data/dev.*	development data
data/test.*	test data
bleu.py	evaluation script for translation
layers.py	some useful neural network layers
utils.py	some other useful functions
model2.py	direct IBM Model 2

If you're using Kaggle, you can also link to the Star Wars Chinese-English dataset there

The training data consists of a number of Star Wars scripts (Episodes 1–2, 4–7, Rogue One, Solo, Rebels S1E1–3). The dev and test data are Episode 3. For each dataset, there are two or three files: .zh means Chinese, .reference.en means (correct) English, and .backstroke.en means the English of Backstroke of the West.

As distributed, model2.py implements the following model, which is a variant of direct IBM Model 2 (Section 3.3.3) with the $t$ and $a$ tables factored into smaller matrices: \begin{align} P(\mathbf{e} \mid \mathbf{f}) &= \prod_{i=1}^m \sum_{j=1}^n a(j \mid i) \, t(e_i \mid f_j) \\ t(e_i \mid f_j) &= \left[\operatorname{softmax} \mathbf{U} \mathbf{V}_{f_j} \right]_{e_i}. \\ a(j \mid i) &= \left[\operatorname{softmax} \mathbf{K} \mathbf{Q}_{i} \right]_j \end{align} where $\mathbf{U}$ and $\mathbf{V}$ are matrices of learnable parameters whose rows can be thought of as embeddings of the English and Chinese vocabulary, and $\mathbf{Q}$ and $\mathbf{K}$ are matrices of learnable parameters whose rows can be thought of as embeddings of the numbers 1 to 100.

In this assignment, you'll improve this model so that it translates better than Backstroke of the West. (Although, because the training data is so small, this is a challenging task, and it probably won't be that much better.)

You may reuse any code you've used in HW1, or even the solution or another student's code as long as you cite properly. You may use any PyTorch function except anything that has Transformer in its name.

1. I've made a lot of special modifications myself

Depending on what computer you're working on, you might need to or uncomment some lines at the top of model2.py.
Run python model2.py,1 which trains direct IBM Model 2 and then translates the test set into the file test.model2.en. It also prints out the BLEU score.
After each epoch, the training prints out the translation of the first ten sentences of the dev set. Write down some observations about what's wrong with the translations.2 (Note: The badness of these translations isn't the fault of Brown et al., who never intended Model 2 to be used directly for translation, as we are doing here.)
Modify Decoder.step() and Decoder.forward() so that the weighted average is inside the softmax, as in equation (3.26) and Figure 3.1c.3
The train perplexity should reach 75 or better and the dev perplexity 150 or better.1
The BLEU score probably didn't get much better, but write down any qualitative differences you see in the dev translations.2 The Chinese word for droid is 机器人 (three tokens). Did you see improvement in the translation of this word, and why?1

2. It's an older code, sir, but it checks out

Modify the model to use a Transformer encoder as described in Section 3.5.4.3 It should have four layers (self-attention, FFN, self-attention, FFN, self-attention, FFN, self-attention, FFN). Use $d=256$ for the encoder and $d=128$ for the decoder. The FFNs should have residual connections (eq. 3.61). Most of your modifications will be in the Encoder class. You may use any of the classes/functions provided in layers.py, many of which have tricks tailored for small training data.
Train the model for 20 epochs. Record the trainer's output1
Report the BLEU score on test. To get full credit, your score must be better than 1.5%.1 Write down some observations about what did or didn't improve.2

3. Now witness the power of this fully operational translation system

Modify the model further to use a Transformer decoder as described in Section 3.5.4.8 It should have just one layer (self-attention, FFN, cross-attention). Use $d=256$. For the self-attention (eq. 3.63), you can use layers.MaskedSelfAttention, which ensures that each English word attends to itself and to the left. Again, the FFNs should have residual connections (eq. 3.64). Most of your modifications will be in the Decoder class; don't forget to edit all three of Decoder.start(), Decoder.step(), and Decoder.forward().
Train the model for 20 epochs. Record the trainer's output1, save your model, and submit the saved model.1
Report the BLEU score on test. To get full credit, your score must be better than Backstroke of the West (4.6%).1 Write down some observations about what did or didn't improve.2

Submission

Please read these submission instructions carefully.

As often as you want, add and commit your submission files to the repository you created in the beginning.
Before submitting, the repository should contain:
- All of the code that you wrote.
- Your final saved model from Part 3.
- A README.md file with
  - Instructions on how to build/run your code.
  - Your responses to all of the instructions/questions in the assignment.
To submit:
- Push your work to GitHub and create a release in GitHub by clicking on "Releases" on the right-hand side, then "Create a new release" or "Draft a new release". Fill in "Tag version" and "Release title" with the part number(s) you're submitting and click "Publish Release".
- If you submit the same part more than once, the grader will grade the latest release for that part.
- For computing the late penalty, the submission time will be considered the commit time, not the release time.

CSE 40657/60657 Homework 2

1. I've made a lot of special modifications myself

2. It's an older code, sir, but it checks out

3. Now witness the power of this fully operational translation system

Submission

CSE 40657/60657
Homework 2