CSE 40657/60657
Homework 1

2018/09/14 at 5pm

In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.


Clone the Homework 1 repository. It contains the following files:

english/trainEnglish training data
english/devEnglish development data
english/testEnglish test data
chinese/train.hanChinese training data
chinese/dev.pinChinese development data: inputs
chinese/dev.hanChinese development data: correct outputs
chinese/test.pinChinese test data: input (pronunciations)
chinese/test.hanChinese test data: correct output (characters)
chinese/charmapChinese character/pronunciation map
keyboard.pyGUI demo
unigram.pyUnigram language model
The data files come from a corpus of Ubuntu's tech support IRC. The file chinese/charmap is derived from the Unicode Unihan database.

Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.

In the following, point values are written after each requirement, like this.30

1. Baseline

The language models that you implement should have the same interface as the Unigram class in They should support the following operations:

(If you're writing in Python, then for fun, you can try plugging your language models into, which shows a keyboard whose keys grow and shrink depending on their current probability.)

Write a program that trains the Unigram model on the training data (english/train), and for each character position in another file (english/dev or english/test), it should predict the most probable character, according to the model, given all previous correct characters.5 (Since the model is already implemented for you, there's not much to do here.) Report the accuracy (what percent of the predictions are correct) on the development set. It should be about 15%.1

2. English

In this part, you'll try to improve the quality of your character predictor.

  1. Implement a 5-gram language model.5 Smoothing is not required (yet), but as you design your data structures, keep in mind that you'll want the flexibility to experiment with smoothing and different n-gram sizes later. Report your accuracy on the development set.1 It should be at least 50%.1
  2. Now try to make your language model better. Briefly describe what modifications you tried, and for each, what the accuracy on the development set was. You must try at least one modification, and the accuracy on the development set should be better than the accuracy in part (a).5
  3. Then, run your best model on the test set and report your accuracy. To get the final point for this part, you must get at least 60%.1

3. Chinese

(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.

For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.

You are given four files:

  1. Write a function candidates(token) that, given an input token token, returns a list of possible characters that the token could specify. It should be the union of the characters in categories (i), (ii), and (iii) above.3
  2. Write code that reads in tokens from chinese/ or chinese/ and, for each token token, uses the language model to predict the character from candidates(token) that has the highest probability.2
  3. Using a bigram language model, run your predictor on the development set and try to improve it (an accuracy of 90% is good). Describe your modifications and, for each, what the accuracy on the development set was.5
  4. Finally, run your predictor on the test set and report the accuracy. To get the final point for this part, you must get at least 87%.1


Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz. If you're making a partial submission, please name your file netid-hw1-part.tgz where part is the part (1, 2, or 3) that you're submitting. Note that submitting two files with the same name will overwrite one of them!

Your submission should contain: