CSE 40657/60657
Homework 5

2019/12/12 5pm

In 2005, a blog post went viral that showed a bootleg copy of Revenge of the Sith with its Chinese version translated (apparently by low-quality machine translation) back into an English movie called Backstroke of the West. Can you do better?

Clone the Homework 5 repository. It contains the following files:

train.zh-entraining data (Chinese-English)
train.zhtraining data (Chinese side)
train.entraining data (English side)
test.zhtest data (Chinese side)
test.entest data (English side)
backstroke.enBackstroke of the West
bleu.pyevaluation script for translation
translate.pyIBM Model 1 decoder
lm.pyLanguage model used by

The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.

1. These are your first steps

  1. Write code to read in train.zh-en.1 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
  2. Write code to create the data structure(s) to store the model parameters $t(f\mid e)$ and initialize them all to uniform.3 Important: You only need a $t(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $t(f \mid e)$. This will save you a lot of memory and time.

2. Join me, and I will complete your training

  1. Write code to perform the E step.3 As described in the notes, for each sentence pair and for each Chinese position $j=1, \ldots, m$ and English position $i=0, \ldots, \ell$, do: $$c(f_j,e_i) \leftarrow c(f_j,e_i) + \frac{t(f_j \mid e_i)}{\sum_{i'=0}^\ell t(f_j \mid e_{i'})}$$
  2. Write code to perform the M step.3 As described in the notes, for each Chinese word type $f$ and English word type $e$ (including NULL), do: $$t(f \mid e) \leftarrow \frac{c(f,e)}{\sum_{f'} c(f',e)}$$
  3. Train the model on the training data. Report the total (natural) log-likelihood of the data after each pass through the data.1 $$ \begin{align*} \text{log-likelihood} &= \sum_{\text{$(\mathbf{f}, \mathbf{e})$ in data}} \log P(\mathbf{f} \mid \mathbf{e}) \\ P(\mathbf{f} \mid \mathbf{e}) &= \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(\sum_{i=0}^\ell t(f_j \mid e_i)\right). \end{align*}$$ It should increase every time and eventually get better than -152000.3
  4. After training, for each English word $e$ in {jedi, force, droid, sith, lightsabre}, for each of the five Chinese words $f$ with the highest $t(f \mid e)$, report both $f$ and $t(f \mid e)$.1 The top translations should be: 绝地, 原力, 机器人, 西斯, and 光剑.5

3. Now witness the power of this fully operational translation system

In this part, you'll use the provided Model 1 decoder to try to translate Episode 3 better than Backstroke of the West.

  1. Write code to dump the word-translation probabilities in the following format:1
    captain 船长 0.7648140965086824
    captain 看 2.1415989262858237e-11
    captain 我们 2.5532871116116466e-09
    captain 搜查 2.1644056596004747e-09
    captain 了 1.1592984773140098e-07
    NULL 船长 3.7303483172036556e-20
    NULL 是 0.04330894875722255
    NULL 长官 7.859239306612376e-14
    NULL ? 3.5071699071940116e-05
    NULL 通知 3.890691080886856e-19
    The order of the lines does not matter.
  2. Translate Episode 3 (test.zh). The decoder should be run like this: your_ttable train.en test.zh
    Translations are written to stdout. In your report, show the output on lines 475–492.1 Comment on what problems you see, and how they might be fixed.3
  3. To evaluate translations accuracy using the BLEU metric, run: your_translations test.en
    The score is between 0 and 1 (higher is better). What BLEU score does Backstroke of the West get?1 What does your system get?1 You must do better than Backstroke of the West.3


Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: