In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Clone the Homework 1 repository. It contains the following files:
data/train | training data |
data/minitrain | smaller training data |
data/dev | development data |
data/test | test data |
predict.py | text-prediction demo |
unigram.py | Unigram language model |
The data files come from the NUS SMS Corpus, a collection of real text messages sent mostly by students at the National University of Singapore. This is the English portion of the corpus, though it has a lot of interesting examples of Singlish, which mixes in elements of Malay and various Chinese languages.
In the following, point values are written after each requirement, like this.30.
All the language models that you build for this assignment should support the following operations. If the model object is called m
, then m
should have the following methods:
m.start()
: Return the start state.m.read(q, a)
: Return the state that the model would be in after it was in state q
and read symbol a
.m.logprob(q, a)
: Return the model's log-probability (base e) of a
.m.best(q)
: Return the symbol with the highest model probability.predict.py
, which predicts the next 20 characters based on what you've typed so far.
The file unigram.py
provides a class Unigram
that implements the above interface using a unigram language model. The Unigram
constructor expects a list of lists of words to train on.
data/train
; be sure to strip off trailing newlines) and uses unigram.Unigram
to train a unigram model.3data/dev
), predict the most probable character given all previous correct characters.5 Report the accuracy (what percent of the predictions are correct),1 which should be about 16.5%.1
python predict.py data/train
By default, it uses a unigram language model, which is not very interesting, because the model always predicts a space. (Nothing to report here.)
In this part, you'll replace the unigram language model with a 5-gram model.
data/train
. Report your accuracy on data/dev
,which should be at least 49%.1.data/test
) and report your accuracy, which should be at least 49%.1predict.py
with your model. (Nothing to report.)
Now we will try building a neural language model using PyTorch. (You can use a different framework if you want, but PyTorch is recommended.)
data/minitrain
first, which takes less than a minute per epoch. Report your train perplexity and dev accuracy.1data/train
and validate on data/dev
. Warning: be sure to allow enough time for this, and it might be good to save the model after every epoch.
For us, each epoch took about 30 minutes, and 10 epochs was enough.
Report your final dev accuracy,1 which should be at least 50%.1data/test
) and report your accuracy,1 which should be at least 51%.1predict.py
with your model. Because training takes a while, you'll probably want to load a trained model from disk. (Nothing to report.)
Please read these submission instructions carefully. This is our first year using this workflow, so apologies in advance for any snags.
git tag -a part1
, git tag -a part2
, etc. If you make the final submission late, we'll use these tags to compute the per-part late penalty. (You can also create the tags after the fact, with git tag -a part1 abc123
, where abc123
is the commit's checksum.)git push --tags origin HEAD
).