CSE 40/60657: Natural Language Processing

Homework 3

Due
2014/03/26 11:55 pm
Points
30

In this assignment, you will build a model that automatically learns the relationship between closely related languages.

Note: You have more time to do this assignment but it is also the hardest assignment of the semester, so try to get an early start.

Download the Homework 3 data from Sakai. It contains the following files:
hamlet.txt Hamlet in original and modern spelling
hamlet.unspaced.txt
sr-hr.txt Serbian-Croatian data
sr-hr.unspaced.txt

The file hamlet.txt contains the text of Hamlet from Shakespeare's First Folio in original and modern English spelling.

The file sr-hr.txt contains strings in Serbian and Croatian, which are extremely closely related but written in Cyrillic and Latin script, respectively. The strings come from the Serbian and Croatian localizations of the GNOME project.

Each line of both files looks like this:

О _ н е ! ||| O _ n e !
There are two strings, separated by triple vertical bars (|||). The two strings are in original-spelling and modern-spelling English, respectively, or Serbian and Croatian, respectively. Within each string, the characters are separated by whitespace, and the space character is represented by an underscore (_). The reason we've represented the data in this odd way is that if you don't want to mess with Unicode, you don't have to—just split on whitespace. (If you prefer to work with Unicode, you can use the unspaced files instead.)

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting.

In the following, point values are written after each requirement, like this.30

  1. To be or not to be
    1. Write code to read in hamlet.txt and determine what letters are in the original and modern alphabets. How many character types are there in each?1
    2. The model has parameters $p(b \mid a)$, where $a$ or $b$ can be $\epsilon$. Write code to initialize the model to emulate classical string edit distance: $p(a \mid a) = 1$ for all $a$ and $p(b \mid a) = \tfrac12$ for all $a \neq b$. (Nothing to report.)
    3. Write code to construct, for each line, a finite-state transducer that maps the original string to the modern string using insertion, deletion, and substitution operations, as shown in the notes, and as shown below in a more code-friendly way. Let $w$ be the original string and $x$ be the modern string. There is a state $(i,j,\mathit{flag})$ for $i \in \{0, \ldots, |w|\}$ and $j \in \{0, \ldots, |x|\}$ and $\mathit{flag} \in \{\mathit{false}, \mathit{true}\}$; and there is a final state $q_f$. The initial state is $(0, 0, \mathit{false})$. The transitions are:
      • Insertion: $(i,j-1,\mathit{false}) \xrightarrow{\epsilon : x_{j-1} \thinspace/\thinspace p(x_{j-1} \mid \epsilon)} (i,j,\mathit{false})$ if $j > 0$
      • Deletion: $(i-1,j,\mathit{true}) \xrightarrow{w_{i-1} : \epsilon \thinspace/\thinspace p(\epsilon \mid w_{i-1})} (i,j,\mathit{false})$ if $i > 0$
      • Substitution: $(i-1,j-1,\mathit{true}) \xrightarrow{w_{i-1} : x_{j-1} \thinspace/\thinspace p(x_{j-1} \mid w_{i-1})} (i,j,\mathit{false})$ if $i > 0$ and $j > 0$
      • Stop insertion: $(i,j,\mathit{false}) \xrightarrow{\epsilon : \epsilon \thinspace/\thinspace p(\epsilon \mid \epsilon)} (i,j,\mathit{true})$
      • All stop: $(|w|, |x|, \mathit{false}) \xrightarrow{</s>:</s> \thinspace/\thinspace 1} q_f$
      As in Homework 2, you'll need to index your FST data structure so that (i) you can loop through the states in topological order, and (ii) for each state, you can access all incoming transitions. (Nothing to report.)
    4. Implement the Viterbi algorithm to find the best path through the FST you constructed. You may reuse any code from Homework 2. For the first ten lines, show each best path in the following format:3 Also show the weight of each best path.3
      [Francisco] Nay  answer me: Stand &   vnfold
      [Francisco] Nay, answer me. Stand and unfold
      
      your selfe.
      your self .
      
      (Your output might not exactly match the above, but it should be close.) Briefly describe whether you think the model works well, and why.3
  2. напред назад
    1. Now switch to sr-hr.txt. Try running the Viterbi algorithm on the first ten lines of this data and show the output.1 You should see that it does fine on many lines, but there are problems. For example, line 3 probably looks like this, which is not right:
      Руковање 
      Rukovanje
      
      Briefly explain what's wrong with this, and why.3 Hint: looks can be deceiving.
    2. Write code to initialize the model uniformly: $p(b \mid a) = \frac1{m+1}$, where $m$ is the size of the Croatian alphabet, for all $a$ and $b$ including $\epsilon$. (Nothing to report.)
    3. Implement the forward algorithm. You will need to implement addition of log-probabilities, as described on chapter 1, page 6 of the notes. Show the forward log-probability of $q_f$ for the first ten lines.2
    4. Implement the backward algorithm. Show the backward log-probability of the initial state for the first ten lines.1 Each should be equal to the forward probability of the final state.3 If not, debug before you move on.
  3. скоро готов
    1. Write code for the E step. For each line, for each edge $e$ labeled $a:b$, increment the fractional count $c(a,b)$ by $\gamma(e)$ as described in the notes. Show all the $c(a,b)$ for the first line for the first iteration of EM.1
    2. Write code for the M step, that is, to reestimate the model from the fractional counts. Show all the $p(b \mid a)$ for the first line for the first iteration of EM.1
    3. Run the EM algorithm on the data and show the total log-likelihood for each iteration, which should always go up.3
    4. After running EM, for the first ten lines, show the Serbian-Croatian alignment that your model finds.1 For each Serbian letter $a$, show $p(b\mid a)$ for every Croatian letter $b$ such that $p(b \mid a) \geq 0.1$.1 Briefly describe what your model learned.3

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai: