When processing text in languages without standardized spelling rules, or historical texts that followed rules that are no longer considered standard, spelling normalization is a necessary first step. In this assignment, you will build a model that learns how to convert Shakespeare's original English spelling to modern English spelling.
Clone the Homework 2 repository. It contains the following files:
fst.py | Module for finite transducers |
cer.py | Module for evaluation |
train.old | Training data in original spelling |
train.new | Training data in modern spelling |
test.old | Test data in original spelling |
test.new | Test data in modern spelling |
The data is the text of Hamlet from Shakespeare's First Folio in original and modern English spelling. The training data is everything up to where Hamlet dies (spoiler alert) and the test data is the last 50 or so lines afterwards.
The fst
module should be extremely helpful for this assignment. If you're writing in a language other than Python, please talk to the instructor about getting equivalent help in your programming language.
In the following, point values are written after each requirement, like this.30
train.new
.1 Feel free to reuse code from HW1, or someone else's code or the official solution's code, as long as you cite it. Smoothing shouldn't be necessary.train.new
and output any character found in train.old
.
fst.string
pretty much does htis for you.)fst.estimate_cond()
to compute probabilities from them. Briefly describe your initialization.1fst.compose()
to compose $M_{\text{LM}}$, $M_{\text{TM}}$, and $M_w$.1
fst.topological_sort
to give you a correct ordering. The notes (page 36) have been updated with a slight variant of the algorithm that is better suited to our data structures.test.old
. For the first ten lines, report the best modernization together with its log-probability:
[Horatio] Now cracke a Noble heart. -108.6678162110888 Good ight sweet Prince, -85.19153221166528and so on.1
cer
module to evaluate how well your modernizer works. You can either call cer.cer()
directly as a function, or run cer.py test.new youroutput.new
where youroutput.new
is a file containing your outputs. (Don't forget to remove the log-probabilities.) Report your score.1 A lower score is better; if you initialize the model well, you should get a score under 10%.1This part is optional for CDT 40310 students, who automatically get full credit.
Now, we'll improve our modernizer by training the model using hard EM. We'll train on parallel text rather than on nonparallel text as in class; it's faster this way and gives better results.
t
of $M_{\text{TM}}$, how many times does the best path use a transition that is made from t
? Note that fst.compose
keeps track of which transitions are made from which: when you do m = fst.compose(m1, m2)
, for each transition t
created, m.composed_from[t]
is the pair of transitions that t
was made from.FST.normalize_cond()
fst.estimate_cond
takes an optional parameter add
that lets you do this.test.old
and measure your score against test.new
. Report your score.1 It should eventually get better than 7.5%.1Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
student*.cse.nd.edu
. If this is not possible, please discuss with the instructor before submitting.