Text mining example: spam filtering

March 27, 2014

First principles in text mining

Common aspects of text mining

  • Separate the words (or phrases) in a large body of text
  • Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
  • Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.

Packages for handling text data

  • tm = text mining
  • wordcloud
  • RTextTools

Example: SMS spam data

Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.

setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
sms_spam_df <- read.csv(file="./data/sms_spam.csv", stringsAsFactors=F)
str(sms_spam_df)
'data.frame':   5559 obs. of  2 variables:
 $ type: chr  "ham" "ham" "ham" "spam" ...
 $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...

Prepare the Corpus

A corpus is a collection of documents

library(tm)
sms_corpus <- Corpus(VectorSource(sms_spam_df$text))
print(sms_corpus)
A corpus with 5559 text documents

The Corpus function is very flexible. It can read PDFs, Word docs, e.g. Here, VectorSource told the Corpus function that each document was an entry in the vector.

Inspecting the corpus

inspect(sms_corpus[1:3])
A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Hope you are having a good week. Just checking in

[[2]]
K..give back my thanks.

[[3]]
Am also doing in cbe only. But have to pay.

Identify real words

Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same.

We clean up the corpus with the tm_map function.

Clean corpus, part 1

#translate all letters to lower case
clean_corpus <- tm_map(sms_corpus, tolower)
# remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# remove punctuation
clean_corpus <- tm_map(clean_corpus, removePunctuation)

Clean corpus, part 2

Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function stopwords reports a list of about 175 such words.

stopwords()[1:10]
 [1] "i"         "me"        "my"        "myself"    "we"       
 [6] "our"       "ours"      "ourselves" "you"       "your"     
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())

Clean corpus, part 3

Finally, remove the excess white space.

clean_corpus <- tm_map(clean_corpus, stripWhitespace)

Inspect the clean corpus

inspect(clean_corpus[1:3])
A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
hope good week just checking 

[[2]]
kgive back thanks

[[3]]
 also cbe pay

Tokenize the corpus

A token is a single element in a text string, in most cases a word.

sms_dtm <- DocumentTermMatrix(clean_corpus)
inspect(sms_dtm[1:4, 30:35])
A document-term matrix (4 documents, 6 terms)

Non-/sparse entries: 0/24
Sparsity           : 100%
Maximal term length: 10 
Weighting          : term frequency (tf)

    Terms
Docs accenture accept access accessible accidant accident
   1         0      0      0          0        0        0
   2         0      0      0          0        0        0
   3         0      0      0          0        0        0
   4         0      0      0          0        0        0

Spam or Ham

From here you can use a naive Bayes classifier to build a spam filter based on the words in the message.

Just to visualize the differences, we create word clouds for the two groups.

Indices of spam and ham

The indices of the two message types in the original data carry over to the corpus and doc term matrix.

spam_indices <- which(sms_spam_df$type == "spam")
spam_indices[1:3]
[1] 4 5 9
ham_indices <- which(sms_spam_df$type == "ham")
ham_indices[1:3]
[1] 1 2 3

Word cloud for ham

library(wordcloud)
wordcloud(clean_corpus[ham_indices], min.freq=40)

plot of chunk unnamed-chunk-11

Word cloud for spam

wordcloud(clean_corpus[spam_indices], min.freq=40)

plot of chunk unnamed-chunk-12

Building a spam filter

Naive Bayes Classifier

  • This type of “classifier” assigns a probability that a new sample is in one class or another (spam or ham).
  • From the words in the message and the words not in the message, compute the probability of spam or ham.
  • It is based on Bayes rule, fequency analysis of occurances of words and an independence assumption (the naive part).

Compute probabilities from training data

Given a word W,

\[ P(spam|W) = \frac{P(W|spam) \times P(spam)}{P(W)} \]

For words \( W_1 \), \( W_2 \) and \( \neg W_3 \), e.g.,

\[ P(spam|W_1 \cap W_2 \cap \neg W_3) = \]

\[ \frac{P(W_1 \cap W_2 \cap \neg W_3|spam) \times P(spam)}{P(W_1 \cap W_2 \cap \neg W_3)} \]

Independence assumption

Assuming the \( W_i \) are independent, this becomes

\[ P(spam|W_1 \cap W_2 \cap \neg W_3) = \]

\[ \frac{P(W_1|spam) \times P(W_2|spam) \times P(\neg W_3|spam) \times P(spam)}{P(W_1)\times P(W_2)\times P(\neg W_3)} \]

The validity of the assumption is less important than how the classifier performs.

Training the probabilities

  • Probabilities on right hand side of the equation are calculated in training data;
  • Estimated probability on the left hand side assigned to new messages and used to classify;
  • Some test data is set aside to check the accuracy of the classification.

Divide corpus into training and test data

Use 75% training and 25% test.

sms_raw_train <- sms_spam_df[1:4169,]
sms_raw_test <- sms_spam_df[4170:5559,]

and the document-term matrix and clean corpus

sms_dtm_train <- sms_dtm[1:4169,]
sms_dtm_test <- sms_dtm[4170:5559,]
sms_corpus_train <- clean_corpus[1:4169]
sms_corpus_test <- clean_corpus[4170:5559]

Proportions of spam and ham in training and test are similar.

Separate training data to spam and ham

spam <- subset(sms_raw_train, type == "spam")
ham <- subset(sms_raw_train, type == "ham")

Identify frequently used words

Don't muddy the classifier with words that may only occur a few times.

To identify words appearing at least 5 times:

five_times_words <- findFreqTerms(sms_dtm_train, 5)
length(five_times_words)
[1] 1226
five_times_words[1:5]
[1] "abiola" "able"   "abt"    "accept" "access"

Create document-term matrices using frequent words

sms_train <- DocumentTermMatrix(sms_corpus_train, control=list(dictionary = five_times_words))

sms_test <- DocumentTermMatrix(sms_corpus_test, control=list(dictionary = five_times_words))

Convert count information to "Yes", "No"

Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurances. Convert the document-term matrices.

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

Convert document-term matrices

sms_train <- apply(sms_train, 2, convert_count)
sms_test <- apply(sms_test, 2, convert_count)

The NaiveBayes function

We'll use a Naive Bayes classifier provided in the package e1071.

library(e1071)

Create a Naive Bayes classifier object

We'll do this on the training data.

sms_classifier <- naiveBayes(sms_train, factor(sms_raw_train$type))
class(sms_classifier)
[1] "naiveBayes"

Evaluate the performance on the test data

  • Given a model or classifier object there is usually a predict function to test the model on new data.
  • Following predicts the classifications of messages in the test set based on the probabilities generated with the training set.
sms_test_pred <- predict(sms_classifier, newdata=sms_test)

Check the predictions against reality

We have predictions and we have a factor of real spam-ham classifications. Generate a table.

table(sms_test_pred, sms_raw_test$type)

sms_test_pred  ham spam
         ham  1202   31
         spam    5  152

Spam filter performance

  • It correctly classified 83% of the spam;
  • It correctly classifies > 99% of the ham;
  • This is good balance.

Flavor of text mining

  • Separate the words (or phrases) in a large body of text
  • Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
  • Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.
  • Use frequency of words in documents to classify documents into types or associate documents by similarity