In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Download the Homework 2 files [tgz] [zip]. It contains the following files:
english/train | English training data |
english/dev | English development data |
english/test | English test data |
chinese/train.han | Chinese training data |
chinese/dev.pin | Chinese development data: inputs |
chinese/dev.han | Chinese development data: correct outputs |
chinese/test.pin | Chinese test data: input (pronunciations) |
chinese/test.han | Chinese test data: correct output (characters) |
chinese/charmap | Chinese character/pronunciation map |
keyboard.py | GUI demo |
chinese/charmap
is derived from the Unicode Unihan database.
Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.
In the following, point values are written after each requirement, like this.30
Implement a character-based language model using Witten-Bell smoothing, as described in Sections 5.3.2 and 5.5 of the notes (in particular, Equation 5.14). You're welcome to try a different kind of model instead, like Kneser-Ney smoothing. However, Witten-Bell is sufficient for this task. (I really wanted to recommend neural networks as another option, but unfortunately they don't seem to work well under the constraints of this assignment.)
It's recommended, but not required, that your language model have the same interface as class Uniform
in keyboard.py
. (If it does, then, for fun, you can try plugging your language model into that script, which shows a keyboard whose keys grow and shrink depending on their current probability.)
In any case, your code should be able to:
english/train
).3Briefly describe what kind of language model you implemented along with any relevant implementation details.4
In this part, you'll evaluate your language model on English, by reading in text (simulating what a user might type) and predicting what each character would be.
(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.
For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.
You'll be given four files:
chinese/train.han
), which you will train your model on.chinese/charmap
) in the following format:
㐀 qiu
㐁 tian
㐄 kua
Each line has exactly two whitespace-separated columns. The first column is the Chinese character, and the second column is the pronunciation.
chinese/dev.pin
or chinese/test.pin
) in the following format:
o f a n , <space> ni zai yong shen me v p s ya ?
(which means, "ofan, what vps [virtual private server] are you using?")
This is a simulation of what a user might type. For each whitespace-separated token, you have to guess what character the user meant to type (given the correct previous characters). Each token could be:
ni
), which can convert to one of the Chinese characters listed in the character map,o
), which can convert to itself, or<space>
, which converts to a space.chinese/dev.han
or chinese/test.han
), which you will compare your predictions against.yi
.2Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. Please name your file hw2-netid.tgz.