CSE 40/60657: Natural Language Processing

Homework 1

Due: 2014/01/27 11:55 pm
Points: 30

In this assignment, you will build various text classification models and experiment with them on a corpus of positive and negative movie reviews (Pang and Lee, sentence polarity dataset v1.0).

Download the Homework 1 data from Sakai. It contains the following files:

train_pos positive training sentences

train_neg negative training sentences

test_pos positive test sentences

test_neg negative test sentences

train_pos and train_neg together form the training set; test_pos and test_neg together form the test set. All files are one sentence per line, tokenized and lowercased, so you don't have to do any preprocessing.

You may write code in any language you choose. It should build and run on student*.cse.nd.edu, but if this is not possible, please discuss it with the instructor before submitting.

In the following, point values are written after each requirement, like this.30

Implement a naïve Bayes classifier.
1. Write code to read in the training sentences and collect counts $c(k)$ (number of documents in class $k$) and $c(k, w)$ for $k \in \{\mathord+, \mathord-\}$ and all words $w$. Report $c(k)$ and $c(k, w)$ for $k \in \{\mathord+, \mathord-\}$ and $w \in \{\text{movie}, \text{film}\}$.1
2. Write code to compute the probabilities $p(k)$ and $p(w \mid k)$ for all $k,w$. Train on the training set. Report $p(k)$ and $p(w \mid k)$ for $k \in \{\mathord+, \mathord-\}$ and $w \in \{\text{movie}, \text{film}\}$.1
3. Write code to read in a test sentence and compute the probabilities $P(\mathord+ \mid d)$ and $P(\mathord- \mid d)$. Report these two probabilities for the first line of train_pos and the first line of train_neg.1
4. Run the classifier on the test set and report your accuracy,2 which should be at least 75%.3
5. Describe any implementation choices you had to make (e.g., smoothing, log-probabilities).2
Implement logistic regression.
1. Write the rest of the code for the classifier. Report what weight updates you would make (starting from all zero weights) on the first line of train_pos.1 Now run the trainer on the whole training set. At each iteration, report your accuracy on the training set.2
2. Run the classifier on the test set and report your accuracy,2 which should be at least 75%.3
3. Describe any implementation choices you had to make (e.g., random shuffling of examples, learning rate, number of iterations, weight averaging).2
Experiment with (at least) two new kinds of features.
1. Extend the code you wrote to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features.2 Report your new accuracy for both naïve Bayes1 and logistic regression.1 Briefly write down your conclusions from this experiment.1
2. Do the same thing for another kind of feature.5

Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:

A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
All of the code that you wrote.
A README file with instructions on how to build and run your code on student*.cse.nd.edu. If this is not possible, please discuss with the instructor before submitting.

`train_pos`	positive training sentences
`train_neg`	negative training sentences
`test_pos`	positive test sentences
`test_neg`	negative test sentences