# CSE 40657/60657 Homework 5

Due
Mon 2021/05/10 5pm
Points
30

In this assignment you will build a simple system for answering questions based on passages from Wikipedia.

Whenever the instructions below say to "report" something, it should be reported in the README.md file that you submit.

## 1. QA as MT

1. Visit this GitHub Classroom link to create a Git repository for you, and clone it to your computer. Initially, it contains the following files:
 minitrain.txt Small training data train.txt Training data dev.txt Development data eval.py Evaluation script
Each line of the data files has three tab-separated fields: a context, a question, and a set of answers, like this (with tabs written as \t)
architec@@ turally , the school has a catholic character . at@@ op the
main building 's gold d@@ ome is a golden statue of the virgin mary
. immediately in front of the main building and facing it , is a
copper statue of christ with arms up@@ raised with the legend " ven@@
ite ad me om@@ nes " . next to the main building is the basilica of
the sacred heart . immediately behind the basilica is the gro@@ t@@ to
, a mar@@ ian place of prayer and refl@@ ection . it is a repl@@ ica
of the gro@@ t@@ to at l@@ our@@ des , france where the virgin mary
repu@@ tedly appeared to saint ber@@ na@@ de@@ tte sou@@ bi@@ rou@@ s
in 185@@ 8 . at the end of the main drive ( and in a direct line that
connects through 3 stat@@ ues and the gold d@@ ome ) , is a simple ,
modern stone statue of mary . \t to whom did the virgin mary alle@@ ge@@
dly appear in 185@@ 8 in l@@ our@@ des france ? \t 118-127

The context is a passage from a Wikipedia article and the question is guaranteed to have an answer that can be found in the passage. The answers are space-separated ranges of integers. Here, there's just one answer, 118-127, which means that the answer starts at token 118 using zero-based indexing and ends before token 127, again using zero-based indexing. (In other words, it's just like a Python slice 118:127.)
2. Write code to read a file in this format.3 For each line, construct a "source" string containing both the question and the context:
<BOS> to whom did the virgin mary ... <SEP> architec@@ turally , the school has a catholic character ... <EOS>
And construct a set of "target" strings, one for each answer:
<BOS> saint ber@@ na@@ de@@ tte sou@@ bi@@ rou@@ s <EOS>
3. Using transformer.py from the HW2 solution, write code to train a model to "translate" questions+contexts into answers.2 If there is more than one answer, repeat the question+context for each answer.
4. Write code to answer the questions in the development set.2 For each question+context, your system should attempt to answer the question once.
5. For this part and part 2, please use minitrain.txt as both the training and development set, which the model ought to be able to learn (near-)perfectly. After about 50 epochs, the perplexity (ppl) on the development set should be less than 1.2.1
6. Please report your system's answers to the first ten questions in minitrain.txt.1
7. Use eval.py to measure the F1 score of your answers, and report the score, which should be at least 95%.1

## 2. Span-based QA

1. Leave the encoder unchanged, but remove the decoder from the model and replace it with a layer that directly predicts the starting and ending word of the answer:5 \begin{align} \mathbf{w}_\text{start}, \mathbf{w}_\text{end} &\in \mathbb{R}^{d} \\ \mathbf{p}_\text{start}, \mathbf{p}_\text{end} &\in \mathbb{R}^n \\ \mathbf{p}_{\text{start}} &= \operatorname{softmax} \mathbf{H} \mathbf{w}_\text{start} \\ \mathbf{p}_{\text{end}} &= \operatorname{softmax} \mathbf{H} \mathbf{w}_\text{end} \end{align}
2. Modify the training code so that, for each question and answer in the training data, if $i$ and $j$ are the indices of the first and last word of the answer, respectively, then it minimizes the loss function:5
3. $$L = -(\log [\mathbf{p}_\text{start}]_{i} + \log [\mathbf{p}_\text{end}]_{j}).$$
4. Write to guess the best answer for each question in the development data. Replace the Model.translate function with a new function that guesses the $i$ and $j$ (such that $i \leq j$) that minimize $L$ above.2
5. After about 50 epochs, the dev loss should be below 501 and the F1 score should reach at least 95%.1 We've observed quite a bit of variance in the dev loss, so you might try running a few times -- no need to report mean and standard deviation, though you are welcome to.
6. Please report your system's answers to the first ten questions in minitrain.txt.1

## 3. Full training data

1. Once the system seems to be working, train on the full training data (train.txt) and answer the questions in the development data (dev.txt), which is just the first 1000 lines of the full development data.
2. Expect each epoch to take roughly 30 minutes. After about 5 epochs, the development loss should be less than 140001 and F1 should reach at least 15%.1 (This method would do a lot better with pretraining.)
3. Please report your system's answers to the first ten questions in dev.txt.1
4. Please have a look through the rest the outputs and write down a few observations about what the system can and can't answer.2

2. After you complete each part, create a commit and tag it with git tag -a part1, git tag -a part2, etc. If you make the final submission late, we'll use these tags to compute the per-part late penalty. (You can also create the tags after the fact, with git tag -a part1 abc123, where abc123 is the commit's checksum.)
3. Push your repository and its tags to GitHub (git push --tags origin HEAD).