CSE 40657/60657: Natural Language Processing

Clone the Homework 5 repository. It contains the following files:

train.zh-en	training data
train.align	alignments for training data
test.zh	test data, input sentences
test.en	test data, reference translations
align-f1.py	evaluation script for alignment
bleu.py	evaluation script for translation
fst.py	module for finite transducers (updated)
lm.py	module for language models

The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3. The alignments in train.align are not perfect; they are computed by a higher-quality alignment model (Model 4), but you'll treat them as a "silver" standard to evaluate your Model 1 alignments.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.

1. Word alignment

In this part, you'll implement IBM Model 1.

The model awakens
- Write code to read in train.zh-en. It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
- Write code to create the data structure(s) to store the model parameters $t(f\mid e)$ and initialize them all to uniform.2 Important: You only need a $t(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $t(f \mid e)$. This will save you a lot of memory and time.
Train the model you must
- Write code to perform the E step.3 As described in the notes, for each sentence pair and for each Chinese position $j=1, \ldots, m$ and English position $i=0, \ldots, \ell$, do: $$c(f_j,e_i) \leftarrow c(f_j,e_i) + \frac{t(f_j \mid e_i)}{\sum_{i'=0}^\ell t(f_j \mid e_{i'})}$$
- Write code to perform the M step.3 As described in the notes, for each Chinese word type $f$ and English word type $e$ (including NULL), do: $$t(f \mid e) \leftarrow \frac{c(f,e)}{\sum_{f'} c(f',e)}$$
- Train the model on the training data. Report the total (natural) log-likelihood of the data after each pass through the data.1 $$ \begin{align*} \text{log-likelihood} &= \sum_{\text{$(\mathbf{f}, \mathbf{e})$ in data}} \log P(\mathbf{f} \mid \mathbf{e}) \\ P(\mathbf{f} \mid \mathbf{e}) &= \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(\sum_{i=0}^\ell t(f_j \mid e_i)\right). \end{align*}$$ It should increase every time and eventually get better than -155000.3
- After training, report $t(f \mid e)$ for the following (correct) word pairs:1
  
  English $e$ Chinese $f$
  
  jedi 绝地
  
  droid 机械人
  
  force 原力
  
  midi-chlorians 原虫
  
  yousa 你
Test the model. Or test not. There is no try
- Write code that, for each sentence pair of the training data, finds the best alignment according to the model, which is, for each $j$: $$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$ Print the (non-NULL) alignments in the same format as train.align.1
- Evaluate your alignments against train.align using the command
```
python align-f1.py your_alignments train.align
```
  Report your F1 score. It should be at least 50%.3

English $e$	Chinese $f$
jedi	绝地
droid	机械人
force	原力
midi-chlorians	原虫
yousa	你

2. Monotone decoding

True Model 1 decoding should search through all possible reorderings of the input sentence; since this is somewhat involved, we'll try something simpler, which is to translate Chinese sentences word-for-word without reordering.

These were your father's transducers. Elegant weapons for a more civilized age.
- Use the provided function lm.make_kneserney() to build a FST for a smoothed n-gram language model.1 Call this $M_{\text{LM}}$. Start with 2-grams; you can try higher-order models later. The states of $M_{\text{LM}}$ are tuples of various lengths, and there are $\varepsilon$-transitions from longer tuples to shorter tuples. So, later on when you do the Viterbi algorithm, you'll need to sort the states from longest to shortest. (I apologize that this is somewhat obscure, but it makes the LM a lot more efficient.)
- Write code to build a FST for a translation model using the probabilities learned in Part 1.3 Call this $M_{\text{TM}}$. There should be just two states (a start state q0 and an accept state q1), and transitions:
  - A transition from q0 to q1 on symbol $\texttt{</s>:</s>}$
  - For each observed Chinese word type $f$, for each of its top 10 English translations $e$, a transition from q0 to q0 on symbol $f:e$ with probability $t(f \mid e)$. If $e = \text{NULL}$, make the transition output $\varepsilon$ (empty string).
  - We will handle unknown words simply by deleting them. For each Chinese word type $f$ in the test data, add a transition from q0 to q0 on symbol $f:\varepsilon$ with very small probability (like $10^{-100}$).
- Write code that, given a Chinese sentence $\mathbf{f}$, does the following:3
  - Build a FST that maps $\mathbf{f}$ to itself. Call this $M_{\mathbf{f}}$.
  - Compose $M_{\mathbf{f}}$ with $M_{\text{TM}}$.
  - Compose that with $M_{\text{LM}}$.
  - Use the Viterbi algorithm to find the best translation. The states of the composed FST are of the form $((q_{\mathbf{f}}, q_{\text{TM}}), q_{\text{LM}})$. So the Viterbi algorithm should sort the states by $q_{\mathbf{f}}$, and in the case of a tie, by $-|q_{\text{LM}}|$ (that is, from longest to shortest LM state).
Now witness the power of this fully operational translation system
- Translate the first ten sentences of Episode 3 (test.zh). In your report, show the generated translations.1
- The translations unfortunately aren't going to be very good. In your report, on what problems you see, and how they might be fixed.3
- Translate the whole movie (this took us under 18 minutes; under 7 minutes using PyPy) and evaluate the accuracy of your translations using the BLEU metric:
```
bleu.py your_translations episode3.en
```
  Report your score, which is between 0 and 1 (higher is better). Using the settings above, you should be able to get at least 0.01.3 To get the final point, try adjusting settings (reporting what you tried) to get the score up to 0.02.1

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 5

1. Word alignment

2. Monotone decoding

Submission

CSE 40657/60657
Homework 5