Download the Homework 5 data from Sakai [tgz] [zip]. It contains the following files:
data/episode1.zh-en | training data |
data/episode1.align | "true" alignments |
data/episode3.zh-en | test data |
scripts/align-f1.py | evaluation script |
The "true" alignments are not produced by hand; they're produced by a better word alignment model, but are by no means perfect.
You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even another student's code as long as you cite him/her properly.
Option A: Word Alignment
Implement IBM Model 1, trained using stochastic gradient ascent (SGA) as described in the notes. (If you consult other sources, they will almost certainly use a different method, expectation-maximization. If you implement that instead, it's fine too.)
The model can be written as:
$$P(\mathbf{f} \mid \mathbf{e}) = \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(t(f_j \mid \text{NULL}) + \sum_{i=1}^\ell t(f_j \mid e_i)\right).$$
(Compare the algorithm on pages 60–61 in the notes, which computes this.) The $t(f \mid e)$, in turn, are defined in terms of unconstrained parameters $\lambda(f \mid e)$:
$$t(f \mid e) = \frac{\exp \lambda(f \mid e)}{\sum_{f'} \exp \lambda(f' \mid e)}. $$
- The model awakens
- Write code to read in
episode1.zh-en
.2 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t
). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
- Write code to create the data structure(s) to store the model parameters $\lambda(f\mid e)$.2
Initialize them all to zero. Important: You only need a $\lambda(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $\lambda(f \mid e)$. This will save you a lot of memory and time.
- Write code to compute, for any sentence pair, $\log P(\mathbf{f} \mid \mathbf{e})$ according to the above formulas and the algorithm on pages 60–61.3
With the $\lambda(f \mid e)$ initialized to zero, report $\log P(\mathbf{f} \mid \mathbf{e})$ for the first five lines of episode1.zh-en
.3
- Train the model you must
- Write code to maximize $\log P(\mathbf{f} \mid \mathbf{e})$ by SGA.3
Unfortunately, it's not convenient to use Autograd, so we've derived the gradient for you. Here is the pseudocode for SGA:
- for $t \leftarrow 1, \ldots, T$ do
- $\eta \leftarrow 1/t$ # this worked pretty well for us, but you can experiment
- $LL \leftarrow 0$
- randomly shuffle training data # you can experiment with this too
- for $\mathbf{f}, \mathbf{e}$ in training data do
- $LL \leftarrow LL + \log P(\mathbf{f} \mid \mathbf{e})$
- for $j \leftarrow 1, \ldots, m$ do
- $Z \leftarrow t(f_j \mid \text{NULL}) + \sum_{i'=1}^\ell t(f_j \mid e_{i'})$
- for $i \leftarrow 0, \ldots, \ell$ do
- $p \leftarrow t(f_j \mid e_i)/Z$ # probability that $f_j$'s partner is $e_i$
- $\lambda(f_j \mid e_i) \leftarrow \lambda(f_j \mid e_i) + \eta p$
- for all $f$
- $\lambda(f \mid e_i) \leftarrow \lambda(f \mid e_i) - \eta p \cdot t(f \mid e_i)$
- Train the model. Report the log-likelihood $\sum_{\mathbf{f}, \mathbf{e}} \log P(\mathbf{f} \mid \mathbf{e})$, after each pass through the data.1 It should increase (almost) every time.3
- After training, report $t(f \mid e)$ (not $\lambda(f \mid e)$) for the following (correct) word pairs:3
English $e$ | Chinese $f$ |
jedi | 绝地 |
droid | 机械人 |
force | 原力 |
midi-chlorians | 原虫 |
yousa | 你 |
- Test the model. Or test not. There is no try
- Write code that, for every sentence pair, finds the best alignment according to the model, which is, for each $j$:
$$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$
Print the (non-NULL) alignments in the same format as
episode1.align
.3 For example:
0-1 2-0
means that Chinese word 0 is aligned to English word 1, Chinese word 1 is unaligned (aligned to NULL), and Chinese word 2 is aligned to English word 0.
Report your alignments for the first five lines of episode1.zh-en
.3
- Evaluate your alignments against
episode1.align
using the command
align-f1.py your_alignments episode1.align
Report your F1 score. It should be at least 50%.4
Option B: Syntax-Based Translation
Coming soon!
Submission
Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
- A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
- All of the code that you wrote.
- A README file with instructions on how to build and run your code.