In 2005, a blog post went viral that showed a bootleg copy of Revenge of the Sith with its Chinese version translated (apparently by low-quality machine translation) back into an English movie called Backstroke of the West. Can we do better? In this assignment, you'll build a model to learn word translations from bilingual Star Wars scripts -- the first step to building a better translation system.
Clone the Homework 5 repository (note: repository updated 2018/11/27 10am). It contains the following files:
train.zh-en | training data |
train.align | alignments for training data |
test.zh | test data, input sentences |
test.en | test data, reference translations |
align-f1.py | evaluation script for alignment |
bleu.py | evaluation script for translation |
The training data is Star Wars Episodes 1, 2, and 4–6. The test data is Episode 3. The alignments in train.align
are not perfect; they are computed by a higher-quality alignment model (Model 4) on a much larger data set, but you'll treat them as a "silver" standard to evaluate your Model 1 alignments.
You may write code in any language you choose. You may reuse any code you've used in previous homework assignments, or even the solution or another student's code as long as you cite properly.
train.zh-en
.1 It contains one sentence pair per line, with the Chinese and English versions separated by a tab (\t
). Both sides have been tokenized. Below, we assume that a fake word $\text{NULL}$ is prepended to every English sentence, so that $e_0 = \text{NULL}$.train.align
.30-1 2-0
means that Chinese word 0 is aligned to English word 1, Chinese word 1 is unaligned (aligned to NULL), and Chinese word 2 is aligned to English word 0. (Note that in this format, positions are zero-based, unlike in the math notation elsewhere in this document.)
Report your alignments for the first five lines of train.zh-en
.3
train.align
using the command
python align-f1.py your_alignments train.align
Report your F1 score. It should be at least 50%.5There doesn't seem to be enough time left this semester to require the next part, which is to try to build a translation system that translates Episode 3 better than Backstroke of the West. So this part is entirely optional, worth zero points. Consequently, I've also left it more open-ended than usual.
True Model 1 decoding should search through all possible reorderings of the input sentence. This is somewhat involved, so in this part we'll just try word-for-word translation, with no reordering.
test.zh
). They unfortunately aren't going to be very good. In your report, show the translations and on what problems you see, and how they might be fixed.bleu.py your_translations test.en
Report your score, which is between 0 and 1 (higher is better). Did you do better than Backstroke of the West, which gets a BLEU of 0.0315?
lm.py
with code for training a Kneser-Ney smoothed language model: lm.kneserney(data, n)
creates a n-gram language model (as a FST) from the data in data
(as a list of lists of words). The states are tuples of various lengths, and there are $\varepsilon$-transitions from longer tuples to shorter tuples. So, when you do the Viterbi algorithm, you'll need to sort the states from longest to shortest. (I apologize that this is somewhat obscure, but it makes the LM a lot more efficient.)train.align
-- to extract multi-word expressions and their translations. (In the early 2000's, this, combined with the availability of more data, led to dramatic improvements in translation quality.)Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: