CSE 40/60657: Natural Language Processing
Homework 1
- Due
- 2014/01/27 11:55 pm
- Points
- 30
In this assignment, you will build various text classification models and experiment with them on a corpus of positive and negative movie reviews (Pang and Lee, sentence polarity dataset v1.0).
Download the Homework 1 data from Sakai. It contains the following files:
train_pos | positive training sentences |
train_neg | negative training sentences |
test_pos | positive test sentences |
test_neg | negative test sentences |
train_pos
and train_neg
together form the training set; test_pos
and test_neg
together form the test set. All files are one sentence per line, tokenized and lowercased, so you don't have to do any preprocessing.
You may write code in any language you choose. It should build and run on student*.cse.nd.edu
, but if this is not possible, please discuss it with the instructor before submitting.
In the following, point values are written after each requirement, like this.30
- Implement a naïve Bayes classifier.
- Write code to read in the training sentences and collect counts $c(k)$ (number of documents in class $k$) and $c(k, w)$ for $k \in \{\mathord+, \mathord-\}$ and all words $w$. Report $c(k)$ and $c(k, w)$ for $k \in \{\mathord+, \mathord-\}$ and $w \in \{\text{movie}, \text{film}\}$.1
- Write code to compute the probabilities $p(k)$ and $p(w \mid k)$ for all $k,w$. Train on the training set. Report $p(k)$ and $p(w \mid k)$ for $k \in \{\mathord+, \mathord-\}$ and $w \in \{\text{movie}, \text{film}\}$.1
- Write code to read in a test sentence and compute the probabilities $P(\mathord+ \mid d)$ and $P(\mathord- \mid d)$. Report these two probabilities for the first line of
train_pos
and the first line oftrain_neg
.1 - Run the classifier on the test set and report your accuracy,2 which should be at least 75%.3
- Describe any implementation choices you had to make (e.g., smoothing, log-probabilities).2
- Implement logistic regression.
- Write the rest of the code for the classifier. Report what weight updates you would make (starting from all zero weights) on the first line of
train_pos
.1 Now run the trainer on the whole training set. At each iteration, report your accuracy on the training set.2 - Run the classifier on the test set and report your accuracy,2 which should be at least 75%.3
- Describe any implementation choices you had to make (e.g., random shuffling of examples, learning rate, number of iterations, weight averaging).2
- Write the rest of the code for the classifier. Report what weight updates you would make (starting from all zero weights) on the first line of
- Experiment with (at least) two new kinds of features.
- Extend the code you wrote to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features.2 Report your new accuracy for both naïve Bayes1 and logistic regression.1 Briefly write down your conclusions from this experiment.1
- Do the same thing for another kind of feature.5
Please submit all of the following in a gzipped tar archive (.tar.gz or .tgz; not .zip or .rar) via Sakai:
- A PDF file (not .doc or .docx) with your responses to the instructions/questions above.
- All of the code that you wrote.
- A README file with instructions on how to build and run your code on
student*.cse.nd.edu
. If this is not possible, please discuss with the instructor before submitting.