In situations where text input is slow (mobile phones, Chinese/Japanese characters, users with disabilities), it can be helpful for the computer to be able to guess the next character(s) the user will type. In this assignment, you'll build a character language model and test how well it can predict the next character.
Clone the Homework 1 repository. It contains the following files:
english/train | English training data |
english/dev | English development data |
english/test | English test data |
chinese/train.han | Chinese training data |
chinese/dev.pin | Chinese development data: inputs |
chinese/dev.han | Chinese development data: correct outputs |
chinese/test.pin | Chinese test data: input (pronunciations) |
chinese/test.han | Chinese test data: correct output (characters) |
chinese/charmap | Chinese character/pronunciation map |
keyboard.py | GUI demo |
unigram.py | Unigram language model |
chinese/charmap
is derived from the Unicode Unihan database.
Note: the Chinese files are in UTF-8. If you're using Python 3, you won't have to worry about this, but if you're using something else (like Python 2), be sure that you are dealing with Unicode characters, not bytes.
In the following, point values are written after each requirement, like this.30
The language models that you implement should have the same interface as the Unigram
class in keyboard.py
. They should support the following operations:
train(filename)
: Train the model on the text file named filename
(e.g., english/train
).start()
: Forget about all previously-read characters.read(c)
: Read in character c
.prob(c)
: Return the probability of c
given all previously-read characters.keyboard.py
, which shows a keyboard whose keys grow and shrink depending on their current probability.)
Write a program that trains the Unigram
model on the training data (english/train
), and for each character position in another file (english/dev
or english/test
), it should predict the most probable character, according to the model, given all previous correct characters.5 (Since the model is already implemented for you, there's not much to do here.) Report the accuracy (what percent of the predictions are correct) on the development set. It should be about 15%.1
In this part, you'll try to improve the quality of your character predictor.
(Mandarin) Chinese is written using a rather large set of characters (3,000–4,000), which presents a challenge for typing. Most younger users type Chinese using a standard QWERTY keyboard to type the pronunciation of each character in Pinyin. The reason this became possible is the advent of statistical models that automatically guess what the right character is.
For this part, you will write a program that can read in pronunciations (simulating what a user might type) and predicts what the correct Chinese characters are.
You are given four files:
chinese/train.han
), written in Chinese characters, which you will train your language model on.chinese/charmap
) in the following format:
㐀 qiu
㐁 tian
㐄 kua
Each line has exactly two whitespace-separated columns. The first column is the Chinese character, and the second column is the pronunciation.
chinese/dev.pin
or chinese/test.pin
) in the following format:
o f a n , <space> ni zai yong shen me v p s ya ?
(which means, "ofan, what vps [virtual private server] are you using?")
This is a simulation of what a user might type. For each whitespace-separated input token, you have to guess what character the user meant to type (given the correct previous characters). Each input token could be one of:
ni
): For every line in the character map whose second field matches the pronunciation, the first field gives a possible output character.o
): In this case, the Latin character itself is also a possible output character.<space>
: In this case, the output character is always a space.chinese/dev.han
or chinese/test.han
), which you will compare your predictions against.candidates(token)
that, given an input token token
, returns a list of possible characters that the token could specify. It should be the union of the characters in categories (i), (ii), and (iii) above.3chinese/dev.pin
or chinese/test.pin
and, for each token token
, uses the language model to predict the character from candidates(token)
that has the highest probability.2Please submit all of the following in a gzipped tar archive (.tgz; not .zip or .rar) via Sakai. If you're making a full submission, please name your file netid-hw1.tgz
. If you're making a partial submission, please name your file netid-hw1-part.tgz
where part
is the part (1
, 2
, or 3
) that you're submitting. Note that submitting two files with the same name will overwrite one of them!
Your submission should contain: