When processing text in languages without standardized spelling rules, or historical texts that followed rules that are no longer considered standard, spelling normalization is a necessary first step. In this assignment, you will build a model that learns how to convert Shakespeare's original English spelling to modern English spelling.
Clone the Homework 2 repository. It contains the following files:
fst.py | Module for finite transducers |
cer.py | Module for evaluation |
train.old | Training data in original spelling |
train.new | Training data in modern spelling |
test.old | Test data in original spelling |
test.new | Test data in modern spelling |
The data is the text of Hamlet from Shakespeare's First Folio in original and modern English spelling. The training data is everything up to where Hamlet dies (spoiler alert) and the test data is the last 50 or so lines afterwards.
The fst
module contains a FST
class and associated functions that should be extremely helpful for this assignment. If you're writing in a language other than Python, please talk to the instructor about getting equivalent help in your programming language.
In the following, point values are written after each requirement, like this.30
train.new
.1 The fst
module provides code for this: just use fst.make_ngram(open("train.new"))
.FST
class. You can use the method FST.visualize()
to view your FST and make sure that it looks something like this (shown here for a two-letter alphabet):
Yours should be able to input any character found in train.new
and output any character found in train.old
.
FST.visualize()
to check your results.
FST.normalize_cond()
method to ensure that probabilities correctly sum to one. Briefly describe your initialization.1fst.compose()
to compose $M_{\text{LM}}$, $M_{\text{TM}}$, and $M_w$.1
test.old
. For the first ten lines, report the best modernization together with its log-probability:
[Horatio] Now cracke a Noble heart. -108.6678162110888 Good ight sweet Prince, -85.19153221166528and so on.1
cer
module to evaluate how well your modernizer works. You can either call cer.cer()
directly as a function, or run cer.py test.new youroutput.new
where youroutput.new
is a file containing your outputs. (Don't forget to remove the log-probabilities.) Report your score.1 A lower score is better; if you initialize the model well, you should get a score under 10%.1Now we'll improve our modernizer by training the model using hard EM. We'll train on parallel text rather than on nonparallel text as in class; it's faster this way and gives better results.
fst.compose()
creates a transition, it stores the transitions it simulates in the composed_from
attribute.FST.normalize_cond()
takes an optional parameter add
that lets you do this.test.old
and measure your score against test.new
. Report your score.1 It should eventually get better than 7.5%.1Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
student*.cse.nd.edu
. If this is not possible, please discuss with the instructor before submitting.