In 2005, a blog post went viral that showed a bootleg copy of Revenge of the Sith with its Chinese version translated (apparently by low-quality machine translation) back into English. Can you do better?
Clone the Homework 5 repository. It contains the following files:
train.zh-en | training data |
train.align | alignments for training data |
test.zh | test data, input sentences |
test.en | test data, reference translations |
align-f1.py | evaluation script for alignment |
bleu.py | evaluation script for translation |
fst.py | module for finite transducers (updated) |
lm.py | module for language models |
The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3. The alignments in train.align
are not perfect; they are computed by a higher-quality alignment model (Model 4), but you'll treat them as a "silver" standard to evaluate your Model 1 alignments.
You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.
In this part, you'll implement IBM Model 1.
train.zh-en
. It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t
). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.English $e$ | Chinese $f$ |
---|---|
jedi | 绝地 |
droid | 机械人 |
force | 原力 |
midi-chlorians | 原虫 |
yousa | 你 |
train.align
.10-1 2-0
means that Chinese word 0 is aligned to English word 1, Chinese word 1 is unaligned (aligned to NULL), and Chinese word 2 is aligned to English word 0. (Note that in this format, positions are zero-based, unlike in the math notation elsewhere in this document.)
Report your alignments for the first five lines of train.zh-en
.1train.align
using the command
python align-f1.py your_alignments train.align
Report your F1 score. It should be at least 50%.3True Model 1 decoding should search through all possible reorderings of the input sentence; since this is somewhat involved, we'll try something simpler, which is to translate Chinese sentences word-for-word without reordering.
lm.make_kneserney()
to build a FST for a smoothed n-gram language model.1 Call this $M_{\text{LM}}$. Start with 2-grams; you can try higher-order models later. The states of $M_{\text{LM}}$ are tuples of various lengths, and there are $\varepsilon$-transitions from longer tuples to shorter tuples. So, later on when you do the Viterbi algorithm, you'll need to sort the states from longest to shortest. (I apologize that this is somewhat obscure, but it makes the LM a lot more efficient.)test.zh
). In your report, show the generated translations.1 bleu.py your_translations episode3.en
Report your score, which is between 0 and 1 (higher is better). Using the settings above, you should be able to get at least 0.01.3 To get the final point, try adjusting settings (reporting what you tried) to get the score up to 0.02.1
Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: