CSE 40657/60657: Natural Language Processing

Setup

Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. It contains the following files:

`data/train`	training data
`data/dev`	development data
`data/test`	test data
`predict.py`	text-prediction demo
`unigram.py`	unigram language model

The data files come from the NUS SMS Corpus, a collection of real text messages sent mostly by students at the National University of Singapore. This is the English portion of the corpus, though it has a lot of interesting examples of Singlish, which mixes in elements of Malay and various Chinese languages.

In the following, point values are written after each requirement, like this.30.

All the language models that you build for this assignment should support the following operations. If the model object is called m, then m should have the following methods:

m.start(): Return the start state.
m.input(q, a): Return the state that the model would be in after it was in state q and input symbol a.
m.best(q): Return the symbol with the highest model probability.

A model implementing this interface can be plugged into predict.py, which predicts the next 20 characters based on what you've typed so far.

1. Baseline

The file unigram.py provides a class Unigram that implements the above interface using a unigram language model. The Unigram constructor expects a list of lists of characters to train on.

Write a program that reads in the training data (data/train; be sure to strip off trailing newlines) and uses unigram.Unigram to train a unigram model.3
For each character position in the development data (data/dev), predict the most probable character given all previous correct characters.5 Report the accuracy (what percent of the predictions are correct),1 which should be about 16.5%.1
Try running python predict.py data/train By default, it uses a unigram language model, which is not very interesting, because the model always predicts a space. (Nothing to report here.)

2. n-gram language model

In this part, you'll replace the unigram language model with a 5-gram model.

Implement a 5-gram language model.5 For smoothing, just do something very simple, like add-0.01 smoothing.3
Train it on data/train. Report your accuracy on data/dev,which should be at least 37%.1.
After you've gotten your model working, run it on the test set (data/test) and report your accuracy, which should be at least 42%.1
Try running predict.py with your model. (Nothing to report.)

3. RNN language model

Now we will try building a neural language model using PyTorch. (You can use a different framework if you want, but PyTorch is recommended.)

To get started with PyTorch, try our tutorial notebook, which trains a unigram language model.
Write code to implement an RNN language model.7 You may reuse any code from the tutorial notebook, and you may use any functions provided by PyTorch. As observed in the notes (bottom of page 26), a simple RNN isn't very sensitive to dependencies between the previous state and the input symbol, and it's actually difficult to make a simple RNN do better than 5-grams on a dataset this small. This modification works better (in place of equation 2.25): \begin{align} \mathbf{v}^{(t)} &= \mathbf{B} \mathbf{x}^{(t)} \\ \mathbf{z}_i &= \sum_{j=1}^d \sum_{k=1}^d \mathbf{W}_{ijk} \mathbf{h}^{(t-1)}_j \mathbf{v}^{(t)}_k & i&= 1, \ldots, d\\ \mathbf{h}^{(t)} &= \tanh (\mathbf{z} + \mathbf{c}) \end{align} where $\mathbf{W}$ is a tensor of size $d \times d \times d$. We used $d=64$.
Train on data/train and validate on data/dev. Save the model after every epoch and submit your best model.1 For us, each epoch took about 15 minutes, and 12 or 13 epochs was usually enough to reach a dev accuracy over 40%.
After you've gotten your model working, run it on the test set (data/test) and report your accuracy,1 which should be at least 45%.1
Try running predict.py with your model. Because training takes a while, you'll probably want to load a trained model from disk. (Nothing to report.)

Submission

Please read these submission instructions carefully.

As often as you want, add and commit your submission files to the repository you created in the beginning.
Before submitting, the repository should contain:
- All of the code that you wrote.
- Your final saved model from Part 3.
- A README.md file with
  - Instructions on how to build/run your code.
  - Your responses to all of the instructions/questions in the assignment.
To submit:
- Push your work to GitHub and create a release in GitHub by clicking on "Releases" on the right-hand side, then "Create a new release" or "Draft a new release". Fill in "Tag version" and "Release title" with the part number(s) you're submitting and click "Publish Release".
- If you submit the same part more than once, the grader will grade the latest release for that part.
- For computing the late penalty, the submission time will be considered the commit time, not the release time.

CSE 40657/60657 Homework 1

Setup

1. Baseline

2. n-gram language model

3. RNN language model

Submission

CSE 40657/60657
Homework 1