CSE 40657/60657: Natural Language Processing

Clone the Homework 5 repository. It contains the following files:

train.zh-en	training data (Chinese-English)
train.zh	training data (Chinese side)
train.en	training data (English side)
test.zh	test data (Chinese side)
test.en	test data (English side)
backstroke.en	Backstroke of the West
bleu.py	evaluation script for translation
translate.py	IBM Model 1 decoder
lm.py	Language model used by translate.py

The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.

1. These are your first steps

Write code to read in train.zh-en.1 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
Write code to create the data structure(s) to store the model parameters $t(f\mid e)$ and initialize them all to uniform.3 Important: You only need a $t(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $t(f \mid e)$. This will save you a lot of memory and time.

2. Join me, and I will complete your training

Write code to perform the E step.3 As described in the notes, for each sentence pair and for each Chinese position $j=1, \ldots, m$ and English position $i=0, \ldots, \ell$, do: $$c(f_j,e_i) \leftarrow c(f_j,e_i) + \frac{t(f_j \mid e_i)}{\sum_{i'=0}^\ell t(f_j \mid e_{i'})}$$
Write code to perform the M step.3 As described in the notes, for each Chinese word type $f$ and English word type $e$ (including NULL), do: $$t(f \mid e) \leftarrow \frac{c(f,e)}{\sum_{f'} c(f',e)}$$
Train the model on the training data. Report the total (natural) log-likelihood of the data after each pass through the data.1 $$ \begin{align*} \text{log-likelihood} &= \sum_{\text{$(\mathbf{f}, \mathbf{e})$ in data}} \log P(\mathbf{f} \mid \mathbf{e}) \\ P(\mathbf{f} \mid \mathbf{e}) &= \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(\sum_{i=0}^\ell t(f_j \mid e_i)\right). \end{align*}$$ It should increase every time and eventually get better than -152000.3
After training, for each English word $e$ in {jedi, force, droid, sith, lightsabre}, for each of the five Chinese words $f$ with the highest $t(f \mid e)$, report both $f$ and $t(f \mid e)$.1 The top translations should be: 绝地, 原力, 机器人, 西斯, and 光剑.5

3. Now witness the power of this fully operational translation system

In this part, you'll use the provided Model 1 decoder to try to translate Episode 3 better than Backstroke of the West.

Write code to dump the word-translation probabilities in the following format:1

captain 船长 0.7648140965086824
captain 看 2.1415989262858237e-11
captain 我们 2.5532871116116466e-09
captain 搜查 2.1644056596004747e-09
captain 了 1.1592984773140098e-07
NULL 船长 3.7303483172036556e-20
NULL 是 0.04330894875722255
NULL 长官 7.859239306612376e-14
NULL ? 3.5071699071940116e-05
NULL 通知 3.890691080886856e-19

The order of the lines does not matter.

Translate Episode 3 (test.zh). The decoder should be run like this:
```
translate.py your_ttable train.en test.zh
```
Translations are written to stdout. In your report, show the output on lines 475–492.1 Comment on what problems you see, and how they might be fixed.3
To evaluate translations accuracy using the BLEU metric, run:
```
bleu.py your_translations test.en
```
The score is between 0 and 1 (higher is better). What BLEU score does Backstroke of the West get?1 What does your system get?1 You must do better than Backstroke of the West.3

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 5

1. These are your first steps

2. Join me, and I will complete your training

3. Now witness the power of this fully operational translation system

Submission

CSE 40657/60657
Homework 5