March 27, 2014
Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.
setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
sms_spam_df <- read.csv(file="./data/sms_spam.csv", stringsAsFactors=F)
str(sms_spam_df)
'data.frame': 5559 obs. of 2 variables:
$ type: chr "ham" "ham" "ham" "spam" ...
$ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...
A corpus is a collection of documents
library(tm)
sms_corpus <- Corpus(VectorSource(sms_spam_df$text))
print(sms_corpus)
A corpus with 5559 text documents
The Corpus
function is very flexible. It can read PDFs, Word docs, e.g. Here, VectorSource
told the Corpus
function that each document was an entry in the vector.
inspect(sms_corpus[1:3])
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Hope you are having a good week. Just checking in
[[2]]
K..give back my thanks.
[[3]]
Am also doing in cbe only. But have to pay.
Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same.
We clean up the corpus with the tm_map
function.
#translate all letters to lower case
clean_corpus <- tm_map(sms_corpus, tolower)
# remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# remove punctuation
clean_corpus <- tm_map(clean_corpus, removePunctuation)
Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function stopwords
reports a list of about 175 such words.
stopwords()[1:10]
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())
Finally, remove the excess white space.
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
inspect(clean_corpus[1:3])
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
hope good week just checking
[[2]]
kgive back thanks
[[3]]
also cbe pay
A token is a single element in a text string, in most cases a word.
sms_dtm <- DocumentTermMatrix(clean_corpus)
inspect(sms_dtm[1:4, 30:35])
A document-term matrix (4 documents, 6 terms)
Non-/sparse entries: 0/24
Sparsity : 100%
Maximal term length: 10
Weighting : term frequency (tf)
Terms
Docs accenture accept access accessible accidant accident
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
From here you can use a naive Bayes classifier to build a spam filter based on the words in the message.
Just to visualize the differences, we create word clouds for the two groups.
The indices of the two message types in the original data carry over to the corpus and doc term matrix.
spam_indices <- which(sms_spam_df$type == "spam")
spam_indices[1:3]
[1] 4 5 9
ham_indices <- which(sms_spam_df$type == "ham")
ham_indices[1:3]
[1] 1 2 3
library(wordcloud)
wordcloud(clean_corpus[ham_indices], min.freq=40)
wordcloud(clean_corpus[spam_indices], min.freq=40)
Given a word W,
\[ P(spam|W) = \frac{P(W|spam) \times P(spam)}{P(W)} \]
For words \( W_1 \), \( W_2 \) and \( \neg W_3 \), e.g.,
\[ P(spam|W_1 \cap W_2 \cap \neg W_3) = \]
\[ \frac{P(W_1 \cap W_2 \cap \neg W_3|spam) \times P(spam)}{P(W_1 \cap W_2 \cap \neg W_3)} \]
Assuming the \( W_i \) are independent, this becomes
\[ P(spam|W_1 \cap W_2 \cap \neg W_3) = \]
\[ \frac{P(W_1|spam) \times P(W_2|spam) \times P(\neg W_3|spam) \times P(spam)}{P(W_1)\times P(W_2)\times P(\neg W_3)} \]
The validity of the assumption is less important than how the classifier performs.
Use 75% training and 25% test.
sms_raw_train <- sms_spam_df[1:4169,]
sms_raw_test <- sms_spam_df[4170:5559,]
and the document-term matrix and clean corpus
sms_dtm_train <- sms_dtm[1:4169,]
sms_dtm_test <- sms_dtm[4170:5559,]
sms_corpus_train <- clean_corpus[1:4169]
sms_corpus_test <- clean_corpus[4170:5559]
Proportions of spam and ham in training and test are similar.
spam <- subset(sms_raw_train, type == "spam")
ham <- subset(sms_raw_train, type == "ham")
Don't muddy the classifier with words that may only occur a few times.
To identify words appearing at least 5 times:
five_times_words <- findFreqTerms(sms_dtm_train, 5)
length(five_times_words)
[1] 1226
five_times_words[1:5]
[1] "abiola" "able" "abt" "accept" "access"
sms_train <- DocumentTermMatrix(sms_corpus_train, control=list(dictionary = five_times_words))
sms_test <- DocumentTermMatrix(sms_corpus_test, control=list(dictionary = five_times_words))
Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurances. Convert the document-term matrices.
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
sms_train <- apply(sms_train, 2, convert_count)
sms_test <- apply(sms_test, 2, convert_count)
We'll use a Naive Bayes classifier provided in the package e1071.
library(e1071)
We'll do this on the training data.
sms_classifier <- naiveBayes(sms_train, factor(sms_raw_train$type))
class(sms_classifier)
[1] "naiveBayes"
predict
function to test the model on new data.sms_test_pred <- predict(sms_classifier, newdata=sms_test)
We have predictions and we have a factor of real spam-ham classifications. Generate a table.
table(sms_test_pred, sms_raw_test$type)
sms_test_pred ham spam
ham 1202 31
spam 5 152