Menu

CSE 40657/60657
Homework 2

Due
Week of 2016/09/26
Points
30

In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.

Setup

Download the Homework 2 files [tgz] [zip]. It contains the following files:

english/trainEnglish training data
english/devEnglish development data
english/testEnglish test data
chinese/train.hanChinese training data
chinese/dev.pinChinese development data: inputs
chinese/dev.hanChinese development data: correct outputs
chinese/test.pinChinese test data: input (pronunciations)
chinese/test.hanChinese test data: correct output (characters)
chinese/charmapChinese character/pronunciation map
keyboard.pyGUI demo
The data files come from a corpus of Ubuntu's tech support IRC. The file chinese/charmap is derived from the Unicode Unihan database.

Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.

In the following, point values are written after each requirement, like this.30

1. Implementation

Implement a character-based language model using Witten-Bell smoothing, as described in Sections 5.3.2 and 5.5 of the notes (in particular, Equation 5.14). You're welcome to try a different kind of model instead, like Kneser-Ney smoothing. However, Witten-Bell is sufficient for this task. (I really wanted to recommend neural networks as another option, but unfortunately they don't seem to work well under the constraints of this assignment.)

It's recommended, but not required, that your language model have the same interface as class Uniform in keyboard.py. (If it does, then, for fun, you can try plugging your language model into that script, which shows a keyboard whose keys grow and shrink depending on their current probability.)

In any case, your code should be able to:

Briefly describe what kind of language model you implemented along with any relevant implementation details.4

2. English

In this part, you'll evaluate your language model on English, by reading in text (simulating what a user might type) and predicting what each character would be.

  1. Write a program to do this. It should read in a file, and for each character position in the file, it should predict the most probable character (given the previous characters) according to the model.2
  2. Report the most probable characters, with their probabilities, for the first ten positions in the development set.1
  3. Report your model's perplexity on the test set.2 It should be greater than one: $\textrm{ppl} = \exp -\frac1N\sum_{i=1}^N \log P(w_i)$, where $i$ ranges over all the character positions in the test set.
  4. Report what percent of the predictions were correct on the test set. For full credit, you should get at least 60%.5

3. Chinese

(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.

For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

You'll be given four files:

  1. Write code to read in the character map. Report how many characters are possible for the pronunciation yi.2
  2. Write code to use your model together with the input file and character map to make predictions, like in Part 2, but constrained to the three categories (i, ii, and iii) listed above. Report the most probable characters, with their probabilities, for the first ten positions in the development set.3
  3. Report the accuracy of your predictions (for all characters) on the test set. For full credit, you should get at least 87%.5

Submission

Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. Please name your file hw2-netid.tgz.