In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Clone the Homework 1 repository. It contains the following files:
data/train | training data |
data/dev | development data |
data/test | test data |
keyboard.py | GUI demo |
fst.py | FST library |
unigram.py | Unigram language model |
rnn/model.npy | RNN parameters |
rnn/vocab | vocabulary for RNN |
In the following, point values are written after each requirement, like this.30
All the language models that you build for this assignment should support the following operations. If the model object is called m
, then m
should have the following methods:
m.get_start()
: Get start state.m.get_transitions(q, c)
: Return a list of transitions (instances of fst.Transition
) that are possible from state q
when the next input symbol is c
.m.get_prob(t)
: Return the probability of transition t
.keyboard.py
, which shows a keyboard whose keys grow and shrink depending on their current probability.
The file unigram.py
provides a class Unigram
that extends fst.FST
. It inherits the first two methods of the above interface from fst.FST
and adds the third method.
Write a program that uses unigram.Unigram
to train a unigram model on the training data (data/train
).1 Then, for each character position in the development data (data/dev
), it should predict the most probable character given all previous correct characters.7 Report the accuracy (what percent of the predictions are correct) on the development data.1 It should be about 15%.1
In this part, you'll try to improve the quality of your character predictor.
get_transitions
to create transitions as needed. Similarly, override get_probs
to assign probabilities to the created transitions—again, proper smoothing is not required, but the probabilities should probably be something other than 0 or NaN!This part is optional for CDT 40310 students, who automatically get full credit. For this part, youll need to have NumPy installed. (If you're using a language other than Python, talk to the instructor about converting rnn.npy
into a format you can use.)
Now we will try using a neural language model. Training neural models can be laborious both for you and the computer (especially if you don't have access to a GPU), so we'll just use a pre-trained model. You are welcome to train your own if you want to.
data/vocab
, which just consists of one symbol per line. (Note that space is one of the symbols, so strip off newlines but not spaces.) The first line is symbol 0, the second line symbol 1, and so on. Write code to read in the development/test data and replace all unknown words with the number for <unk>
.1data/model.npy
using numpy.load
. You will read five objects, corresponding to $\mathbf{A}^\top$, $\mathbf{B}^\top$, $\mathbf{c}$, $\mathbf{D}^\top$, and $\mathbf{e}$ on page 28 of the notes.1 (To get $\mathbf{A}, \mathbf{B}, \mathbf{D}$ as in the notes, you'll need to transpose them; sorry for this inconvenience.)get_start()
computes $\mathbf{h}^{(1)}$, get_transitions(
$\mathbf{h}^{(i-1)}$, a)
creates a transition to state $\mathbf{h}^{(i)}$, and get_probs
returns its probability $[\mathbf{y}^{(i-1)}]_a$. It may be convenient to store transition t
's probability as t.prob
.] Report the accuracy on the test data,1 which should be about 50%.1
Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz
. If you're making a partial submission, please name your file netid-hw1-part.tgz
where part
is the part (1
, 2
, or 3
) that you're submitting. Note that submitting two files with the same name will overwrite one of them!
Your submission should contain: