CSE 40657/60657: Natural Language Processing

Clone the Homework 5 repository (note: repository updated 2018/11/27 10am). It contains the following files:

train.zh-en	training data
train.align	alignments for training data
test.zh	test data, input sentences
test.en	test data, reference translations
align-f1.py	evaluation script for alignment
bleu.py	evaluation script for translation

The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3. The alignments in train.align are not perfect; they are computed by a higher-quality alignment model (Model 4) on a much larger data set, but you'll treat them as a "silver" standard to evaluate your Model 1 alignments.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.

1. These are your first steps

Write code to read in train.zh-en.1 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
Write code to create the data structure(s) to store the model parameters $t(f\mid e)$ and initialize them all to uniform.2 Important: You only need a $t(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $t(f \mid e)$. This will save you a lot of memory and time.

2. Join me, and I will complete your training

Write code to perform the E step.3 As described in the notes, for each sentence pair and for each Chinese position $j=1, \ldots, m$ and English position $i=0, \ldots, \ell$, do: $$c(f_j,e_i) \leftarrow c(f_j,e_i) + \frac{t(f_j \mid e_i)}{\sum_{i'=0}^\ell t(f_j \mid e_{i'})}$$
Write code to perform the M step.3 As described in the notes, for each Chinese word type $f$ and English word type $e$ (including NULL), do: $$t(f \mid e) \leftarrow \frac{c(f,e)}{\sum_{f'} c(f',e)}$$
Train the model on the training data. Report the total (natural) log-likelihood of the data after each pass through the data.1 $$ \begin{align*} \text{log-likelihood} &= \sum_{\text{$(\mathbf{f}, \mathbf{e})$ in data}} \log P(\mathbf{f} \mid \mathbf{e}) \\ P(\mathbf{f} \mid \mathbf{e}) &= \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(\sum_{i=0}^\ell t(f_j \mid e_i)\right). \end{align*}$$ It should increase every time and eventually get better than -175000.3
After training, for each English word $e$ in {Jedi, Force, droid, Sith, lightsabre}, for each of the five Chinese words $f$ with the highest $t(f \mid e)$, report both $f$ and $t(f \mid e)$.1 The top translations should be: 绝地, 原力, 机器人, 西斯, and 光剑.5

3. Test the model. Or test not. There is no try

Write code that, for each sentence pair of the training data, finds the best alignment according to the model, which is, for each $j$: $$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$ Print the (non-NULL) alignments in the same format as train.align.3

0-1 2-0

train.zh-en

Evaluate your alignments against train.align using the command
```
python align-f1.py your_alignments train.align
```
Report your F1 score. It should be at least 50%.5

4. Now witness the power of this fully operational translation system (optional)

There doesn't seem to be enough time left this semester to require the next part, which is to try to build a translation system that translates Episode 3 better than Backstroke of the West. So this part is entirely optional, worth zero points. Consequently, I've also left it more open-ended than usual.

True Model 1 decoding should search through all possible reorderings of the input sentence. This is somewhat involved, so in this part we'll just try word-for-word translation, with no reordering.

Try a hidden Markov model, like in HW4. Write code that reads in Chinese sentences $\mathbf{f}$ and "labels" each Chinese word with an English word, so as to maximize $\left(\prod_j t(f_j \mid e_j)\right) \times p(\mathbf{e})$, where $t(f \mid e)$ is the translation probability distribution learned in Part 2, and $p(\mathbf{e})$ is an English language model.
Translate the first ten sentences of Episode 3 (test.zh). They unfortunately aren't going to be very good. In your report, show the translations and on what problems you see, and how they might be fixed.
Translate the whole movie and evaluate the accuracy of your translations using the BLEU metric:
```
bleu.py your_translations test.en
```
Report your score, which is between 0 and 1 (higher is better). Did you do better than Backstroke of the West, which gets a BLEU of 0.0315?
Some ideas for improvement:
- Model deletions, as in HW2, using the probability $t(f \mid \text{NULL})$ as the probability of deleting $f$.
- The HW5 repository has a module lm.py with code for training a Kneser-Ney smoothed language model: lm.kneserney(data, n) creates a n-gram language model (as a FST) from the data in data (as a list of lists of words). The states are tuples of various lengths, and there are $\varepsilon$-transitions from longer tuples to shorter tuples. So, when you do the Viterbi algorithm, you'll need to sort the states from longest to shortest. (I apologize that this is somewhat obscure, but it makes the LM a lot more efficient.)
- Use the output of your aligner in Part 3 -- or better, use train.align -- to extract multi-word expressions and their translations. (In the early 2000's, this, combined with the availability of more data, led to dramatic improvements in translation quality.)

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 5

1. These are your first steps

2. Join me, and I will complete your training

3. Test the model. Or test not. There is no try

4. Now witness the power of this fully operational translation system (optional)

Submission

CSE 40657/60657
Homework 5