CSE 40657/60657: Natural Language Processing

Setup

Clone the Homework 1 repository. It contains the following files:

`data/train`	training data
`data/dev`	development data
`data/test`	test data
`keyboard.py`	GUI demo
`fst.py`	FST library
`unigram.py`	Unigram language model
`rnn/model.npy`	RNN parameters
`rnn/vocab`	vocabulary for RNN

The data files come from a corpus of Ubuntu's tech support IRC.
fst.py` is a small library for finite automata and transducers. For more information, please see the notebook "Using the fst Module": [static] [interactive].

In the following, point values are written after each requirement, like this.30

1. Baseline

All the language models that you build for this assignment should support the following operations. If the model object is called m, then m should have the following methods:

m.get_start(): Get start state.
m.get_transitions(q, c): Return a list of transitions (instances of fst.Transition) that are possible from state q when the next input symbol is c.
m.get_prob(t): Return the probability of transition t.

For fun, you can try plugging your language models into keyboard.py, which shows a keyboard whose keys grow and shrink depending on their current probability.

The file unigram.py provides a class Unigram that extends fst.FST. It inherits the first two methods of the above interface from fst.FST and adds the third method.

Write a program that uses unigram.Unigram to train a unigram model on the training data (data/train).1 Then, for each character position in the development data (data/dev), it should predict the most probable character given all previous correct characters.7 Report the accuracy (what percent of the predictions are correct) on the development data.1 It should be about 15%.1

2. m-gram language model

In this part, you'll try to improve the quality of your character predictor.

Implement an $m$-gram language model as a weighted automaton.7 Proper smoothing is not required, but the automaton does have to accept all strings. As discussed at the end of Chapter 4, this could cause exponential blowup in the size of the automaton; to avoid this, we suggest overriding get_transitions to create transitions as needed. Similarly, override get_probs to assign probabilities to the created transitions—again, proper smoothing is not required, but the probabilities should probably be something other than 0 or NaN!
Try values of $m$ from 2 to 10 and, for each, report your accuracy on the development set.1. Then, run your best model on the test set and report your accuracy,1 which should be at least 50%.1

3. RNN language model

This part is optional for CDT 40310 students, who automatically get full credit. For this part, youll need to have NumPy installed. (If you're using a language other than Python, talk to the instructor about converting rnn.npy into a format you can use.)

Now we will try using a neural language model. Training neural models can be laborious both for you and the computer (especially if you don't have access to a GPU), so we'll just use a pre-trained model. You are welcome to train your own if you want to.

Write code to read in data/vocab, which just consists of one symbol per line. (Note that space is one of the symbols, so strip off newlines but not spaces.) The first line is symbol 0, the second line symbol 1, and so on. Write code to read in the development/test data and replace all unknown words with the number for <unk>.1
Write code to read in the model from data/model.npy using numpy.load. You will read five objects, corresponding to $\mathbf{A}^\top$, $\mathbf{B}^\top$, $\mathbf{c}$, $\mathbf{D}^\top$, and $\mathbf{e}$ on page 28 of the notes.1 (To get $\mathbf{A}, \mathbf{B}, \mathbf{D}$ as in the notes, you'll need to transpose them; sorry for this inconvenience.)
Write code that, for each character position in the development/test data, predicts the most probable character given all previous correct characters.6 [Added 09/07: Your RNN should still use the same interface as above: get_start() computes $\mathbf{h}^{(1)}$, get_transitions($\mathbf{h}^{(i-1)}$, a) creates a transition to state $\mathbf{h}^{(i)}$, and get_probs returns its probability $[\mathbf{y}^{(i-1)}]_a$. It may be convenient to store transition t's probability as t.prob.] Report the accuracy on the test data,1 which should be about 50%.1

Submission

Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz. If you're making a partial submission, please name your file netid-hw1-part.tgz where part is the part (1, 2, or 3) that you're submitting. Note that submitting two files with the same name will overwrite one of them!

Your submission should contain:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build/run your code.

CSE 40657/60657 Homework 1

Setup

1. Baseline

2. m-gram language model

3. RNN language model

Submission

CSE 40657/60657
Homework 1