CSE 40657/60657
Homework 1

2019/09/20 at 5pm

In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.


Clone the Homework 1 repository. It contains the following files:

data/traintraining data
data/devdevelopment data
data/testtest data
keyboard.pyGUI demo
fst.pyFST library
unigram.pyUnigram language model
rnn/model.npyRNN parameters
rnn/vocabvocabulary for RNN

In the following, point values are written after each requirement, like this.30

1. Baseline

All the language models that you build for this assignment should support the following operations. If the model object is called m, then m should have the following methods:

For fun, you can try plugging your language models into, which shows a keyboard whose keys grow and shrink depending on their current probability.

The file provides a class Unigram that extends fst.FST. It inherits the first two methods of the above interface from fst.FST and adds the third method.

Write a program that uses unigram.Unigram to train a unigram model on the training data (data/train).1 Then, for each character position in the development data (data/dev), it should predict the most probable character given all previous correct characters.7 Report the accuracy (what percent of the predictions are correct) on the development data.1 It should be about 15%.1

2. m-gram language model

In this part, you'll try to improve the quality of your character predictor.

  1. Implement an $m$-gram language model as a weighted automaton.7 Proper smoothing is not required, but the automaton does have to accept all strings. As discussed at the end of Chapter 4, this could cause exponential blowup in the size of the automaton; to avoid this, we suggest overriding get_transitions to create transitions as needed. Similarly, override get_probs to assign probabilities to the created transitions—again, proper smoothing is not required, but the probabilities should probably be something other than 0 or NaN!
  2. Try values of $m$ from 2 to 10 and, for each, report your accuracy on the development set.1. Then, run your best model on the test set and report your accuracy,1 which should be at least 50%.1

3. RNN language model

This part is optional for CDT 40310 students, who automatically get full credit. For this part, youll need to have NumPy installed. (If you're using a language other than Python, talk to the instructor about converting rnn.npy into a format you can use.)

Now we will try using a neural language model. Training neural models can be laborious both for you and the computer (especially if you don't have access to a GPU), so we'll just use a pre-trained model. You are welcome to train your own if you want to.

  1. Write code to read in data/vocab, which just consists of one symbol per line. (Note that space is one of the symbols, so strip off newlines but not spaces.) The first line is symbol 0, the second line symbol 1, and so on. Write code to read in the development/test data and replace all unknown words with the number for <unk>.1
  2. Write code to read in the model from data/model.npy using numpy.load. You will read five objects, corresponding to $\mathbf{A}^\top$, $\mathbf{B}^\top$, $\mathbf{c}$, $\mathbf{D}^\top$, and $\mathbf{e}$ on page 28 of the notes.1 (To get $\mathbf{A}, \mathbf{B}, \mathbf{D}$ as in the notes, you'll need to transpose them; sorry for this inconvenience.)
  3. Write code that, for each character position in the development/test data, predicts the most probable character given all previous correct characters.6 [Added 09/07: Your RNN should still use the same interface as above: get_start() computes $\mathbf{h}^{(1)}$, get_transitions($\mathbf{h}^{(i-1)}$, a) creates a transition to state $\mathbf{h}^{(i)}$, and get_probs returns its probability $[\mathbf{y}^{(i-1)}]_a$. It may be convenient to store transition t's probability as t.prob.] Report the accuracy on the test data,1 which should be about 50%.1


Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz. If you're making a partial submission, please name your file netid-hw1-part.tgz where part is the part (1, 2, or 3) that you're submitting. Note that submitting two files with the same name will overwrite one of them!

Your submission should contain: