CSE 40/60657: Natural Language Processing

Setup

Download the Homework 2 files [tgz] [zip]. It contains the following files:

`english/train`	English training data
`english/dev`	English development data
`english/test`	English test data
`chinese/train.han`	Chinese training data
`chinese/dev.pin`	Chinese development data: inputs
`chinese/dev.han`	Chinese development data: correct outputs
`chinese/test.pin`	Chinese test data: input (pronunciations)
`chinese/test.han`	Chinese test data: correct output (characters)
`chinese/charmap`	Chinese character/pronunciation map
`keyboard.py`	GUI demo

The data files come from a corpus of Ubuntu's tech support IRC. The file chinese/charmap is derived from the Unicode Unihan database.

Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.

In the following, point values are written after each requirement, like this.30

1. Implementation

Implement a character-based language model using Witten-Bell smoothing, as described in Sections 5.3.2 and 5.5 of the notes (in particular, Equation 5.14). You're welcome to try a different kind of model instead, like Kneser-Ney smoothing. However, Witten-Bell is sufficient for this task. (I really wanted to recommend neural networks as another option, but unfortunately they don't seem to work well under the constraints of this assignment.)

It's recommended, but not required, that your language model have the same interface as class Uniform in keyboard.py. (If it does, then, for fun, you can try plugging your language model into that script, which shows a keyboard whose keys grow and shrink depending on their current probability.)

In any case, your code should be able to:

Train the model on a text file (e.g., english/train).3
Compute the probability of a character, given the previous characters.3

Briefly describe what kind of language model you implemented along with any relevant implementation details.4

2. English

In this part, you'll evaluate your language model on English, by reading in text (simulating what a user might type) and predicting what each character would be.

Write a program to do this. It should read in a file, and for each character position in the file, it should predict the most probable character (given the previous characters) according to the model.2
Report the most probable characters, with their probabilities, for the first ten positions in the development set.1
Report your model's perplexity on the test set.2 It should be greater than one: $\textrm{ppl} = \exp -\frac1N\sum_{i=1}^N \log P(w_i)$, where $i$ ranges over all the character positions in the test set.
Report what percent of the predictions were correct on the test set. For full credit, you should get at least 60%.5

3. Chinese

(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.

For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

You'll be given four files:

A training file (chinese/train.han), which you will train your model on.
A character map (chinese/charmap) in the following format:
```
㐀 qiu
㐁 tian
㐄 kua
```
Each line has exactly two whitespace-separated columns. The first column is the Chinese character, and the second column is the pronunciation.
An input file (chinese/dev.pin or chinese/test.pin) in the following format:
```
o f a n , <space> ni zai yong shen me v p s ya ?
```
(which means, "ofan, what vps [virtual private server] are you using?") This is a simulation of what a user might type. For each whitespace-separated token, you have to guess what character the user meant to type (given the correct previous characters). Each token could be:
1. a pronunciation (like ni), which can convert to one of the Chinese characters listed in the character map,
2. a single character (like o), which can convert to itself, or
3. <space>, which converts to a space.
A file of correct outputs (chinese/dev.han or chinese/test.han), which you will compare your predictions against.

Write code to read in the character map. Report how many characters are possible for the pronunciation yi.2
Write code to use your model together with the input file and character map to make predictions, like in Part 2, but constrained to the three categories (i, ii, and iii) listed above. Report the most probable characters, with their probabilities, for the first ten positions in the development set.3
Report the accuracy of your predictions (for all characters) on the test set. For full credit, you should get at least 87%.5

Submission

Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. Please name your file hw2-netid.tgz.

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code.

CSE 40657/60657 Homework 2

Setup

1. Implementation

2. English

3. Chinese

Submission

CSE 40657/60657
Homework 2