In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Clone the Homework 1 repository. It contains the following files:
|English training data|
|English development data|
|English test data|
|Chinese training data|
|Chinese development data: inputs|
|Chinese development data: correct outputs|
|Chinese test data: input (pronunciations)|
|Chinese test data: correct output (characters)|
|Chinese character/pronunciation map|
|Unigram language model|
chinese/charmapis derived from the Unicode Unihan database.
Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.
In the following, point values are written after each requirement, like this.30
The language models that you implement should have the same interface as the
Unigram class in
keyboard.py. They should support the following operations:
train(filename): Train the model on the text file named
start(): Forget about all previously-read characters.
read(c): Read in character
prob(c): Return the probability of
cgiven all previously-read characters.
keyboard.py, which shows a keyboard whose keys grow and shrink depending on their current probability.)
Write a program that trains the
Unigram model on the training data (
english/train), and for each character position in another file (
english/test), it should predict the most probable character, according to the model, given all previous correct characters.5 (Since the model is already implemented for you, there's not much to do here.) Report the accuracy (what percent of the predictions are correct) on the development set. It should be about 15%.1
In this part, you'll try to improve the quality of your character predictor.
(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.
For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.
You are given four files:
chinese/train.han), written in Chinese characters, which you will train your language model on.
chinese/charmap) in the following format:
Each line has exactly two whitespace-separated columns. The first column is the Chinese character, and the second column is the pronunciation.
㐀 qiu 㐁 tian 㐄 kua
chinese/test.pin) in the following format:
(which means, "ofan, what vps [virtual private server] are you using?") This is a simulation of what a user might type. For each whitespace-separated input token, you have to guess what character the user meant to type (given the correct previous characters). Each input token could be one of:
o f a n , <space> ni zai yong shen me v p s ya ?
ni): For every line in the character map whose second field matches the pronunciation, the first field gives a possible output character.
o): In this case, the Latin character itself is also a possible output character.
<space>: In this case, the output character is always a space.
chinese/test.han), which you will compare your predictions against.
candidates(token)that, given an input token
token, returns a list of possible characters that the token could specify. It should be the union of the characters in categories (i), (ii), and (iii) above.3
chinese/test.pinand, for each token
token, uses the language model to predict the character from
candidates(token)that has the highest probability.2
Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file
netid-hw1.tgz. If you're making a partial submission, please name your file
part is the part (
3) that you're submitting. Note that submitting two files with the same name will overwrite one of them!
Your submission should contain: