CSE 40657/60657: Natural Language Processing

Setup

Clone the Homework 1 repository. It contains the following files:

`english/train`	English training data
`english/dev`	English development data
`english/test`	English test data
`chinese/train.han`	Chinese training data
`chinese/dev.pin`	Chinese development data: inputs
`chinese/dev.han`	Chinese development data: correct outputs
`chinese/test.pin`	Chinese test data: input (pronunciations)
`chinese/test.han`	Chinese test data: correct output (characters)
`chinese/charmap`	Chinese character/pronunciation map
`keyboard.py`	GUI demo
`unigram.py`	Unigram language model

The data files come from a corpus of Ubuntu's tech support IRC. The file chinese/charmap is derived from the Unicode Unihan database.

Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.

In the following, point values are written after each requirement, like this.30

1. Baseline

The language models that you implement should have the same interface as the Unigram class in keyboard.py. They should support the following operations:

train(filename): Train the model on the text file named filename (e.g., english/train).
start(): Forget about all previously-read characters.
read(c): Read in character c.
prob(c): Return the probability of c given all previously-read characters.

(If you're writing in Python, then for fun, you can try plugging your language models into keyboard.py, which shows a keyboard whose keys grow and shrink depending on their current probability.)

Write a program that trains the Unigram model on the training data (english/train), and for each character position in another file (english/dev or english/test), it should predict the most probable character, according to the model, given all previous correct characters.5 (Since the model is already implemented for you, there's not much to do here.) Report the accuracy (what percent of the predictions are correct) on the development set. It should be about 15%.1

2. English

In this part, you'll try to improve the quality of your character predictor.

Implement a 5-gram language model.5 Smoothing is not required (yet), but as you design your data structures, keep in mind that you'll want the flexibility to experiment with smoothing and different n-gram sizes later. Report your accuracy on the development set.1 It should be at least 50%.1
Now try to make your language model better. Briefly describe what modifications you tried, and for each, what the accuracy on the development set was. You must try at least one modification, and the accuracy on the development set should be better than the accuracy in part (a).5
Then, run your best model on the test set and report your accuracy. To get the final point for this part, you must get at least 60%.1

3. Chinese

(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.

For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

You are given four files:

A training file (chinese/train.han), written in Chinese characters, which you will train your language model on.
A character map (chinese/charmap) in the following format:
```
㐀 qiu
㐁 tian
㐄 kua
```
Each line has exactly two whitespace-separated columns. The first column is the Chinese character, and the second column is the pronunciation.
An input file (chinese/dev.pin or chinese/test.pin) in the following format:
```
o f a n , <space> ni zai yong shen me v p s ya ?
```
(which means, "ofan, what vps [virtual private server] are you using?") This is a simulation of what a user might type. For each whitespace-separated input token, you have to guess what character the user meant to type (given the correct previous characters). Each input token could be one of:
1. A pronunciation (like ni): For every line in the character map whose second field matches the pronunciation, the first field gives a possible output character.
2. A single character (like o): In this case, the Latin character itself is also a possible output character.
3. <space>: In this case, the output character is always a space.
A file of correct outputs (chinese/dev.han or chinese/test.han), which you will compare your predictions against.

Write a function candidates(token) that, given an input token token, returns a list of possible characters that the token could specify. It should be the union of the characters in categories (i), (ii), and (iii) above.3
Write code that reads in tokens from chinese/dev.pin or chinese/test.pin and, for each token token, uses the language model to predict the character from candidates(token) that has the highest probability.2
Using a bigram language model, run your predictor on the development set and try to improve it (an accuracy of 90% is good). Describe your modifications and, for each, what the accuracy on the development set was.5
Finally, run your predictor on the test set and report the accuracy. To get the final point for this part, you must get at least 87%.1

Submission

Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz. If you're making a partial submission, please name your file netid-hw1-part.tgz where part is the part (1, 2, or 3) that you're submitting. Note that submitting two files with the same name will overwrite one of them!

Your submission should contain:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 1

Setup

1. Baseline

2. English

3. Chinese

Submission

CSE 40657/60657
Homework 1