In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. It contains the following files:
data/train | training data |
data/dev | development data |
data/test | test data |
predict.py | text-prediction demo |
unigram.py | unigram language model |
The data files come from the NUS SMS Corpus, a collection of real text messages sent mostly by students at the National University of Singapore. This is the English portion of the corpus, though it has a lot of interesting examples of Singlish, which mixes in elements of Malay and various Chinese languages.
In the following, point values are written after each requirement, like this.30.
All the language models that you build for this assignment should support the following operations. If the model object is called m
, then m
should have the following methods:
m.start()
: Return the start state.m.input(q, a)
: Return the state that the model would be in after it was in state q
and input symbol a
.m.best(q)
: Return the symbol with the highest model probability.predict.py
, which predicts the next 20 characters based on what you've typed so far.
The file unigram.py
provides a class Unigram
that implements the above interface using a unigram language model. The Unigram
constructor expects a list of lists of characters to train on.
data/train
; be sure to strip off trailing newlines) and uses unigram.Unigram
to train a unigram model.3data/dev
), predict the most probable character given all previous correct characters.5 Report the accuracy (what percent of the predictions are correct),1 which should be about 16.5%.1
python predict.py data/train
By default, it uses a unigram language model, which is not very interesting, because the model always predicts a space. (Nothing to report here.)
In this part, you'll replace the unigram language model with a 5-gram model.
data/train
. Report your accuracy on data/dev
,which should be at least 37%.1.data/test
) and report your accuracy, which should be at least 42%.1predict.py
with your model. (Nothing to report.)
Now we will try building a neural language model using PyTorch. (You can use a different framework if you want, but PyTorch is recommended.)
data/train
and validate on data/dev
. Save the model after every epoch and submit your best model.1
For us, each epoch took about 15 minutes, and 12 or 13 epochs was usually enough to reach a dev accuracy over 40%.data/test
) and report your accuracy,1 which should be at least 45%.1predict.py
with your model. Because training takes a while, you'll probably want to load a trained model from disk. (Nothing to report.)
Please read these submission instructions carefully.