CSE 40657/60657
Homework 5

2018/12/05 5pm

In 2005, a blog post went viral that showed a bootleg copy of Revenge of the Sith with its Chinese version translated (apparently by low-quality machine translation) back into an English movie called Backstroke of the West. Can we do better? In this assignment, you'll build a model to learn word translations from bilingual Star Wars scripts -- the first step to building a better translation system.

This is last year's assignment. For Fall 2019, the assignment may be different.

Clone the Homework 5 repository (note: repository updated 2018/11/27 10am). It contains the following files:

train.zh-entraining data
train.alignalignments for training data
test.zhtest data, input sentences
test.entest data, reference translations
align-f1.pyevaluation script for alignment
bleu.pyevaluation script for translation

The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3. The alignments in train.align are not perfect; they are computed by a higher-quality alignment model (Model 4) on a much larger data set, but you'll treat them as a "silver" standard to evaluate your Model 1 alignments.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.

1. These are your first steps

  1. Write code to read in train.zh-en.1 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
  2. Write code to create the data structure(s) to store the model parameters $t(f\mid e)$ and initialize them all to uniform.2 Important: You only need a $t(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $t(f \mid e)$. This will save you a lot of memory and time.

2. Join me, and I will complete your training

  1. Write code to perform the E step.3 As described in the notes, for each sentence pair and for each Chinese position $j=1, \ldots, m$ and English position $i=0, \ldots, \ell$, do: $$c(f_j,e_i) \leftarrow c(f_j,e_i) + \frac{t(f_j \mid e_i)}{\sum_{i'=0}^\ell t(f_j \mid e_{i'})}$$
  2. Write code to perform the M step.3 As described in the notes, for each Chinese word type $f$ and English word type $e$ (including NULL), do: $$t(f \mid e) \leftarrow \frac{c(f,e)}{\sum_{f'} c(f',e)}$$
  3. Train the model on the training data. Report the total (natural) log-likelihood of the data after each pass through the data.1 $$ \begin{align*} \text{log-likelihood} &= \sum_{\text{$(\mathbf{f}, \mathbf{e})$ in data}} \log P(\mathbf{f} \mid \mathbf{e}) \\ P(\mathbf{f} \mid \mathbf{e}) &= \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(\sum_{i=0}^\ell t(f_j \mid e_i)\right). \end{align*}$$ It should increase every time and eventually get better than -175000.3
  4. After training, for each English word $e$ in {Jedi, Force, droid, Sith, lightsabre}, for each of the five Chinese words $f$ with the highest $t(f \mid e)$, report both $f$ and $t(f \mid e)$.1 The top translations should be: 绝地, 原力, 机器人, 西斯, and 光剑.5

3. Test the model. Or test not. There is no try

  1. Write code that, for each sentence pair of the training data, finds the best alignment according to the model, which is, for each $j$: $$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$ Print the (non-NULL) alignments in the same format as train.align.3
  2. For example:
    0-1 2-0
    means that Chinese word 0 is aligned to English word 1, Chinese word 1 is unaligned (aligned to NULL), and Chinese word 2 is aligned to English word 0. (Note that in this format, positions are zero-based, unlike in the math notation elsewhere in this document.) Report your alignments for the first five lines of train.zh-en.3
  3. Evaluate your alignments against train.align using the command
    python your_alignments train.align
    Report your F1 score. It should be at least 50%.5

4. Now witness the power of this fully operational translation system (optional)

There doesn't seem to be enough time left this semester to require the next part, which is to try to build a translation system that translates Episode 3 better than Backstroke of the West. So this part is entirely optional, worth zero points. Consequently, I've also left it more open-ended than usual.

True Model 1 decoding should search through all possible reorderings of the input sentence. This is somewhat involved, so in this part we'll just try word-for-word translation, with no reordering.

  1. Try a hidden Markov model, like in HW4. Write code that reads in Chinese sentences $\mathbf{f}$ and "labels" each Chinese word with an English word, so as to maximize $\left(\prod_j t(f_j \mid e_j)\right) \times p(\mathbf{e})$, where $t(f \mid e)$ is the translation probability distribution learned in Part 2, and $p(\mathbf{e})$ is an English language model.
  2. Translate the first ten sentences of Episode 3 (test.zh). They unfortunately aren't going to be very good. In your report, show the translations and on what problems you see, and how they might be fixed.
  3. Translate the whole movie and evaluate the accuracy of your translations using the BLEU metric: your_translations test.en
    Report your score, which is between 0 and 1 (higher is better). Did you do better than Backstroke of the West, which gets a BLEU of 0.0315?
  4. Some ideas for improvement:
    • Model deletions, as in HW2, using the probability $t(f \mid \text{NULL})$ as the probability of deleting $f$.
    • The HW5 repository has a module with code for training a Kneser-Ney smoothed language model: lm.kneserney(data, n) creates a n-gram language model (as a FST) from the data in data (as a list of lists of words). The states are tuples of various lengths, and there are $\varepsilon$-transitions from longer tuples to shorter tuples. So, when you do the Viterbi algorithm, you'll need to sort the states from longest to shortest. (I apologize that this is somewhat obscure, but it makes the LM a lot more efficient.)
    • Use the output of your aligner in Part 3 -- or better, use train.align -- to extract multi-word expressions and their translations. (In the early 2000's, this, combined with the availability of more data, led to dramatic improvements in translation quality.)


Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: