In 2005, a blog post went viral that showed a bootleg copy of Star Wars: Episode III – Revenge of the Sith with its Chinese version translated (apparently by low-quality machine translation) back into an English movie called Star War: The Third Gathers – The Backstroke of the West. Can you do better?
Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:
data/train.* | training data |
data/dev.* | development data |
data/test.* | test data |
bleu.py | evaluation script for translation |
layers.py | some useful neural network layers |
utils.py | some other useful functions |
model2.py | direct IBM Model 2 |
If you're using Kaggle, you can also link to the Star Wars Chinese-English dataset there
The training data consists of a number of Star Wars scripts (Episodes 1–2, 4–7, Rogue One, Solo, Rebels S1E1–3). The dev and test data are Episode 3. For each dataset, there are two or three files: .zh
means Chinese, .reference.en
means (correct) English, and .backstroke.en
means the English of Backstroke of the West.
As distributed, model2.py
implements the following model, which is a variant of direct IBM Model 2 (Section 3.3.3) with the $t$ and $a$ tables factored into smaller matrices:
\begin{align}
P(\mathbf{e} \mid \mathbf{f}) &= \prod_{i=1}^m \sum_{j=1}^n a(j \mid i) \, t(e_i \mid f_j) \\
t(e_i \mid f_j) &= \left[\operatorname{softmax} \mathbf{U} \mathbf{V}_{f_j} \right]_{e_i}. \\
a(j \mid i) &= \left[\operatorname{softmax} \mathbf{K} \mathbf{Q}_{i} \right]_j
\end{align}
where $\mathbf{U}$ and $\mathbf{V}$ are matrices of learnable parameters whose rows can be thought of as embeddings of the English and Chinese vocabulary, and
$\mathbf{Q}$ and $\mathbf{K}$ are matrices of learnable parameters whose rows can be thought of as embeddings of the numbers 1 to 100.
In this assignment, you'll improve this model so that it translates better than Backstroke of the West. (Although, because the training data is so small, this is a challenging task, and it probably won't be that much better.)
You may reuse any code you've used in HW1, or even the solution or another student's code as long as you cite properly. You may use any PyTorch function except anything that has Transformer
in its name.
model2.py
.
python model2.py
,1 which trains direct IBM Model 2 and then translates the test set into the file test.model2.en
. It also prints out the BLEU score.Decoder.step()
and Decoder.forward()
so that the weighted average is inside the softmax, as in equation (3.26) and Figure 3.1c.3droid
is 机 器 人
(three tokens). Did you see improvement in the translation of this word, and why?1
Encoder
class. You may use any of the classes/functions provided in layers.py
, many of which have tricks tailored for small training data.test
. To get full credit, your score must be better than 1.5%.1 Write down some observations about what did or didn't improve.2layers.MaskedSelfAttention
, which ensures that each English word attends to itself and to the left.
Again, the FFNs should have residual connections (eq. 3.64).
Most of your modifications will be in the Decoder
class; don't forget to edit all three of Decoder.start()
, Decoder.step()
, and Decoder.forward()
.test
. To get full credit, your score must be better than Backstroke of the West (4.6%).1 Write down some observations about what did or didn't improve.2Please read these submission instructions carefully.