Menu

CSE 40657/60657
Homework 5

Due
2016/12/07 11:55pm
Points
30

In this assignment you have a choice between two translation related tasks: unsupervised word alignment or syntax-based translation.

Download the Homework 5 data from Sakai [tgz] [zip]. It contains the following files:

data/episode1.zh-entraining data
data/episode1.align"true" alignments
data/episode3.zh-entest data
scripts/align-f1.pyevaluation script
The "true" alignments are not produced by hand; they're produced by a better word alignment model, but are by no means perfect.

You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even another student's code as long as you cite him/her properly.

Option A: Word Alignment

Implement IBM Model 1, trained using stochastic gradient ascent (SGA) as described in the notes. (If you consult other sources, they will almost certainly use a different method, expectation-maximization. If you implement that instead, it's fine too.)

The model can be written as: $$P(\mathbf{f} \mid \mathbf{e}) = \frac1{100} \times \prod_{j=1}^m \frac1{\ell+1} \left(t(f_j \mid \text{NULL}) + \sum_{i=1}^\ell t(f_j \mid e_i)\right).$$ (Compare the algorithm on pages 60–61 in the notes, which computes this.) The $t(f \mid e)$, in turn, are defined in terms of unconstrained parameters $\lambda(f \mid e)$: $$t(f \mid e) = \frac{\exp \lambda(f \mid e)}{\sum_{f'} \exp \lambda(f' \mid e)}. $$

  1. The model awakens
    • Write code to read in episode1.zh-en.2 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.
    • Write code to create the data structure(s) to store the model parameters $\lambda(f\mid e)$.2
    • Initialize them all to zero. Important: You only need a $\lambda(f \mid e)$ for every Chinese word $f$ and English word $e$ that occur in the same line. If $f$ and $e$ never occur in the same line, don't create $\lambda(f \mid e)$. This will save you a lot of memory and time.
    • Write code to compute, for any sentence pair, $\log P(\mathbf{f} \mid \mathbf{e})$ according to the above formulas and the algorithm on pages 60–61.3
    • With the $\lambda(f \mid e)$ initialized to zero, report $\log P(\mathbf{f} \mid \mathbf{e})$ for the first five lines of episode1.zh-en.3
  2. Train the model you must
    • Write code to maximize $\log P(\mathbf{f} \mid \mathbf{e})$ by SGA.3
    • Unfortunately, it's not convenient to use Autograd, so we've derived the gradient for you. Here is the pseudocode for SGA:
      • for $t \leftarrow 1, \ldots, T$ do
        • $\eta \leftarrow 1/t$ # this worked pretty well for us, but you can experiment
        • $LL \leftarrow 0$
        • randomly shuffle training data # you can experiment with this too
        • for $\mathbf{f}, \mathbf{e}$ in training data do
          • $LL \leftarrow LL + \log P(\mathbf{f} \mid \mathbf{e})$
          • for $j \leftarrow 1, \ldots, m$ do
            • $Z \leftarrow t(f_j \mid \text{NULL}) + \sum_{i'=1}^\ell t(f_j \mid e_{i'})$
            • for $i \leftarrow 0, \ldots, \ell$ do
              • $p \leftarrow t(f_j \mid e_i)/Z$ # probability that $f_j$'s partner is $e_i$
              • $\lambda(f_j \mid e_i) \leftarrow \lambda(f_j \mid e_i) + \eta p$
              • for all $f$
                • $\lambda(f \mid e_i) \leftarrow \lambda(f \mid e_i) - \eta p \cdot t(f \mid e_i)$
    • Train the model. Report the log-likelihood $\sum_{\mathbf{f}, \mathbf{e}} \log P(\mathbf{f} \mid \mathbf{e})$, after each pass through the data.1 It should increase (almost) every time.3
    • After training, report $t(f \mid e)$ (not $\lambda(f \mid e)$) for the following (correct) word pairs:3
      English $e$ Chinese $f$
      jedi 绝地
      droid 机械人
      force 原力
      midi-chlorians 原虫
      yousa
  3. Test the model. Or test not. There is no try
    • Write code that, for every sentence pair, finds the best alignment according to the model, which is, for each $j$: $$\DeclareMathOperator*{\argmax}{arg~max} a_j = \argmax_{0 \leq i \leq \ell} t(f_j \mid e_i).$$ Print the (non-NULL) alignments in the same format as episode1.align.3
    • For example:
      0-1 2-0
      means that Chinese word 0 is aligned to English word 1, Chinese word 1 is unaligned (aligned to NULL), and Chinese word 2 is aligned to English word 0. Report your alignments for the first five lines of episode1.zh-en.3
    • Evaluate your alignments against episode1.align using the command
      align-f1.py your_alignments episode1.align
      Report your F1 score. It should be at least 50%.4

Option B: Syntax-Based Translation

Coming soon!

Submission

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: