CSE 40657/60657: Natural Language Processing

Setup

Clone the Homework 1 repository. It contains the following files:

`data/train`	training data
`data/minitrain`	smaller training data
`data/dev`	development data
`data/test`	test data
`predict.py`	text-prediction demo
`unigram.py`	Unigram language model

The data files come from the NUS SMS Corpus, a collection of real text messages sent mostly by students at the National University of Singapore. This is the English portion of the corpus, though it has a lot of interesting examples of Singlish, which mixes in elements of Malay and various Chinese languages.

In the following, point values are written after each requirement, like this.30.

All the language models that you build for this assignment should support the following operations. If the model object is called m, then m should have the following methods:

m.start(): Return the start state.
m.read(q, a): Return the state that the model would be in after it was in state q and read symbol a.
m.logprob(q, a): Return the model's log-probability (base e) of a.
m.best(q): Return the symbol with the highest model probability.

A model implementing this interface can be plugged into predict.py, which predicts the next 20 characters based on what you've typed so far.

1. Baseline

The file unigram.py provides a class Unigram that implements the above interface using a unigram language model. The Unigram constructor expects a list of lists of words to train on.

Write a program that reads in the training data (data/train; be sure to strip off trailing newlines) and uses unigram.Unigram to train a unigram model.3
For each character position in the development data (data/dev), predict the most probable character given all previous correct characters.5 Report the accuracy (what percent of the predictions are correct),1 which should be about 16.5%.1
Try running python predict.py data/train By default, it uses a unigram language model, which is not very interesting, because the model always predicts a space. (Nothing to report here.)

2. n-gram language model

In this part, you'll replace the unigram language model with a 5-gram model.

Implement a 5-gram language model.5 For smoothing, just do something very simple, like add-0.01 smoothing.3
Train it on data/train. Report your accuracy on data/dev,which should be at least 49%.1.
After you've gotten your model working, run it on the test set (data/test) and report your accuracy, which should be at least 49%.1
Try running predict.py with your model. (Nothing to report.)

3. RNN language model

Now we will try building a neural language model using PyTorch. (You can use a different framework if you want, but PyTorch is recommended.)

To get started with PyTorch, try our tutorial notebook, which trains a unigram language model.
Write code to implement an RNN language model.5 You may reuse any code from the tutorial notebook, and you may use any functions provided by PyTorch. As observed in the notes (bottom of page 21), a simple RNN isn't very sensitive to dependencies between the previous state and the input symbol, and it's actually difficult to make a simple RNN do better than 5-grams on a dataset this small. This modification works better (in place of equation 2.22): \begin{align} \mathbf{v}^{(i)} &= \mathbf{B} \mathbf{x}^{(i)} \\ \mathbf{z}_j &= \sum_{k=1}^d \sum_{l=1}^d \mathbf{W}_{jkl} \mathbf{h}^{(i-1)}_k \mathbf{v}^{(i)}_l & j&= 1, \ldots, d\\ \mathbf{h}^{(i)} &= \tanh (\mathbf{z} + \mathbf{c}) \end{align} where $\mathbf{W}$ is a tensor of size $d \times d \times d$. We used $d=64$. (Updated 2/18: changed $\mathbf{e}^{(i)}$ to $\mathbf{v}^{(i)}$ to avoid conflict with the parameter $\mathbf{e}$ from the notes.)
Because training can be slow, try training on data/minitrain first, which takes less than a minute per epoch. Report your train perplexity and dev accuracy.1

On such a small dataset, training can vary a lot; expect a typical train perplexity of 5-8 and a dev accuracy of 26-33%.

Train on the full data/train and validate on data/dev. Warning: be sure to allow enough time for this, and it might be good to save the model after every epoch. For us, each epoch took about 30 minutes, and 10 epochs was enough. Report your final dev accuracy,1 which should be at least 50%.1
After you've gotten your model working, run it on the test set (data/test) and report your accuracy,1 which should be at least 51%.1
Try running predict.py with your model. Because training takes a while, you'll probably want to load a trained model from disk. (Nothing to report.)

Submission

Please read these submission instructions carefully. This is our first year using this workflow, so apologies in advance for any snags.

Create a private GitHub repository.
Grant read access to davidweichiang and tnq177.
The repository should contain:
- All of the code that you wrote.
- A README.md file with:
  - The URL of your repository.
  - Instructions on how to build/run your code.
  - Your responses to all of the instructions/questions in the assignment.
After you complete each part, create a tag with git tag -a part1, git tag -a part2, etc. If you make the final submission late, we'll use these tags to compute the per-part late penalty. (You can also create the tags after the fact, with git tag -a part1 abc123, where abc123 is the commit's checksum.)
Push your repository and its tags to GitHub (git push --tags origin HEAD).
Submit your repository to Gradescope under assignment HW1. If you submit multiple times, the most recent submission will be graded. If you make changes to your repository after submission, you must resubmit if you want us to see and grade your changes.

CSE 40657/60657 Homework 1

Setup

1. Baseline

2. n-gram language model

3. RNN language model

Submission

CSE 40657/60657
Homework 1