CSE 40/60657: Natural Language Processing

Download the Homework 5 data from Sakai [tgz] [zip]. It contains the following files:

data/episode1.zh-en	training data
data/episode1.align	"true" alignments
data/episode3.zh-en	test data
scripts/align-f1.py	evaluation script

The "true" alignments are not produced by hand; they're produced by a better word alignment model, but are by no means perfect.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even another student's code as long as you cite him/her properly.

Option A: Word Alignment

Implement IBM Model 1, trained using stochastic gradient ascent (SGA) as described in the notes. (If you consult other sources, they will almost certainly use a different method, expectation-maximization. If you implement that instead, it's fine too.)

The model can be written as: $$P(\mathbf{f} \mid \mathbf{e}) = \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(t(f_j \mid \text{NULL}) + \sum_{i=1}^\ell t(f_j \mid e_i)\right).$$ (Compare the algorithm on pages 60–61 in the notes, which computes this.) The $t(f \mid e)$, in turn, are defined in terms of unconstrained parameters $\lambda(f \mid e)$: $$t(f \mid e) = \frac{\exp \lambda(f \mid e)}{\sum_{f'} \exp \lambda(f' \mid e)}. $$

The model awakens
- Write code to read in episode1.zh-en.2 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
- Write code to create the data structure(s) to store the model parameters $\lambda(f\mid e)$.2
- Write code to compute, for any sentence pair, $\log P(\mathbf{f} \mid \mathbf{e})$ according to the above formulas and the algorithm on pages 60–61.3
Train the model you must
- Write code to maximize $\log P(\mathbf{f} \mid \mathbf{e})$ by SGA.3
- Train the model. Report the log-likelihood $\sum_{\mathbf{f}, \mathbf{e}} \log P(\mathbf{f} \mid \mathbf{e})$, after each pass through the data.1 It should increase (almost) every time.3
- After training, report $t(f \mid e)$ (not $\lambda(f \mid e)$) for the following (correct) word pairs:3
  
  English $e$ Chinese $f$
  
  jedi 绝地
  
  droid 机械人
  
  force 原力
  
  midi-chlorians 原虫
  
  yousa 你
Test the model. Or test not. There is no try
- Write code that, for every sentence pair, finds the best alignment according to the model, which is, for each $j$: $$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$ Print the (non-NULL) alignments in the same format as episode1.align.3
- Evaluate your alignments against episode1.align using the command
```
align-f1.py your_alignments episode1.align
```
  Report your F1 score. It should be at least 50%.4

English $e$	Chinese $f$
jedi	绝地
droid	机械人
force	原力
midi-chlorians	原虫
yousa	你

Option B: Syntax-Based Translation

Coming soon!

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 5

Option A: Word Alignment

Option B: Syntax-Based Translation

Submission

CSE 40657/60657
Homework 5