When processing text in languages without standardized spelling rules, or historical texts that followed rules that are no longer considered standard, spelling normalization is a necessary first step. In this assignment, you will build a model that learns how to convert Shakespeare's original English spelling to modern English spelling.
Clone the Homework 2 repository. It contains the following files:
fst.py | Module for finite transducers |
cer.py | Module for evaluation |
train.old | Training data in original spelling |
train.new | Training data in modern spelling |
test.old | Test data in original spelling |
test.new | Test data in modern spelling |
The data is the text of Hamlet from Shakespeare's First Folio in original and modern English spelling. The training data is everything up to where Hamlet dies (spoiler alert) and the test data is the last 50 or so lines afterwards.
The fst module should be extremely helpful for this assignment. If you're writing in a language other than Python, please talk to the instructor about getting equivalent help in your programming language.
In the following, point values are written after each requirement, like this.30
train.new.1 Feel free to reuse code from HW1, or someone else's code or the official solution's code, as long as you cite it. Smoothing shouldn't be necessary.
Yours should be able to input any character found in train.new and output any character found in train.old.
fst.string pretty much does htis for you.)fst.estimate_cond() to compute probabilities from them. Briefly describe your initialization.1fst.compose() to compose $M_{\text{LM}}$, $M_{\text{TM}}$, and $M_w$.1
fst.topological_sort to give you a correct ordering. The notes (page 36) have been updated with a slight variant of the algorithm that is better suited to our data structures.test.old. For the first ten lines, report the best modernization together with its log-probability:
[Horatio] Now cracke a Noble heart. -108.6678162110888 Good ight sweet Prince, -85.19153221166528and so on.1
cer module to evaluate how well your modernizer works. You can either call cer.cer() directly as a function, or run cer.py test.new youroutput.new where youroutput.new is a file containing your outputs. (Don't forget to remove the log-probabilities.) Report your score.1 A lower score is better; if you initialize the model well, you should get a score under 10%.1This part is optional for CDT 40310 students, who automatically get full credit.
Now, we'll improve our modernizer by training the model using hard EM. We'll train on parallel text rather than on nonparallel text as in class; it's faster this way and gives better results.
t of $M_{\text{TM}}$, how many times does the best path use a transition that is made from t? Note that fst.compose keeps track of which transitions are made from which: when you do m = fst.compose(m1, m2), for each transition t created, m.composed_from[t] is the pair of transitions that t was made from.FST.normalize_cond()fst.estimate_cond takes an optional parameter add that lets you do this.test.old and measure your score against test.new. Report your score.1 It should eventually get better than 7.5%.1Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
student*.cse.nd.edu. If this is not possible, please discuss with the instructor before submitting.