# Text mining example: spam filtering

March 27, 2014

## First principles in text mining

### Common aspects of text mining

• Separate the words (or phrases) in a large body of text
• Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
• Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.

### Packages for handling text data

• tm = text mining
• wordcloud
• RTextTools

### Example: SMS spam data

Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.

setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
str(sms_spam_df)

'data.frame':   5559 obs. of  2 variables:
$type: chr "ham" "ham" "ham" "spam" ...$ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...


### Prepare the Corpus

A corpus is a collection of documents

library(tm)

sms_corpus <- Corpus(VectorSource(sms_spam_df$text)) print(sms_corpus)  A corpus with 5559 text documents  The Corpus function is very flexible. It can read PDFs, Word docs, e.g. Here, VectorSource told the Corpus function that each document was an entry in the vector. ### Inspecting the corpus inspect(sms_corpus[1:3])  A corpus with 3 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] Hope you are having a good week. Just checking in [[2]] K..give back my thanks. [[3]] Am also doing in cbe only. But have to pay.  ### Identify real words Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same. We clean up the corpus with the tm_map function. ### Clean corpus, part 1 #translate all letters to lower case clean_corpus <- tm_map(sms_corpus, tolower) # remove numbers clean_corpus <- tm_map(clean_corpus, removeNumbers) # remove punctuation clean_corpus <- tm_map(clean_corpus, removePunctuation)  ### Clean corpus, part 2 Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function stopwords reports a list of about 175 such words. stopwords()[1:10]   [1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your"  clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())  ### Clean corpus, part 3 Finally, remove the excess white space. clean_corpus <- tm_map(clean_corpus, stripWhitespace)  ### Inspect the clean corpus inspect(clean_corpus[1:3])  A corpus with 3 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] hope good week just checking [[2]] kgive back thanks [[3]] also cbe pay  ### Tokenize the corpus A token is a single element in a text string, in most cases a word. sms_dtm <- DocumentTermMatrix(clean_corpus) inspect(sms_dtm[1:4, 30:35])  A document-term matrix (4 documents, 6 terms) Non-/sparse entries: 0/24 Sparsity : 100% Maximal term length: 10 Weighting : term frequency (tf) Terms Docs accenture accept access accessible accidant accident 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0  ### Spam or Ham From here you can use a naive Bayes classifier to build a spam filter based on the words in the message. Just to visualize the differences, we create word clouds for the two groups. ### Indices of spam and ham The indices of the two message types in the original data carry over to the corpus and doc term matrix. spam_indices <- which(sms_spam_df$type == "spam")
spam_indices[1:3]

[1] 4 5 9

ham_indices <- which(sms_spam_df$type == "ham") ham_indices[1:3]  [1] 1 2 3  ### Word cloud for ham library(wordcloud) wordcloud(clean_corpus[ham_indices], min.freq=40)  ### Word cloud for spam wordcloud(clean_corpus[spam_indices], min.freq=40)  ## Building a spam filter ### Naive Bayes Classifier • This type of “classifier” assigns a probability that a new sample is in one class or another (spam or ham). • From the words in the message and the words not in the message, compute the probability of spam or ham. • It is based on Bayes rule, fequency analysis of occurances of words and an independence assumption (the naive part). ### Compute probabilities from training data Given a word W, $P(spam|W) = \frac{P(W|spam) \times P(spam)}{P(W)}$ For words $$W_1$$, $$W_2$$ and $$\neg W_3$$, e.g., $P(spam|W_1 \cap W_2 \cap \neg W_3) =$ $\frac{P(W_1 \cap W_2 \cap \neg W_3|spam) \times P(spam)}{P(W_1 \cap W_2 \cap \neg W_3)}$ ### Independence assumption Assuming the $$W_i$$ are independent, this becomes $P(spam|W_1 \cap W_2 \cap \neg W_3) =$ $\frac{P(W_1|spam) \times P(W_2|spam) \times P(\neg W_3|spam) \times P(spam)}{P(W_1)\times P(W_2)\times P(\neg W_3)}$ The validity of the assumption is less important than how the classifier performs. ### Training the probabilities • Probabilities on right hand side of the equation are calculated in training data; • Estimated probability on the left hand side assigned to new messages and used to classify; • Some test data is set aside to check the accuracy of the classification. ### Divide corpus into training and test data Use 75% training and 25% test. sms_raw_train <- sms_spam_df[1:4169,] sms_raw_test <- sms_spam_df[4170:5559,]  and the document-term matrix and clean corpus sms_dtm_train <- sms_dtm[1:4169,] sms_dtm_test <- sms_dtm[4170:5559,] sms_corpus_train <- clean_corpus[1:4169] sms_corpus_test <- clean_corpus[4170:5559]  Proportions of spam and ham in training and test are similar. ### Separate training data to spam and ham spam <- subset(sms_raw_train, type == "spam") ham <- subset(sms_raw_train, type == "ham")  ### Identify frequently used words Don't muddy the classifier with words that may only occur a few times. To identify words appearing at least 5 times: five_times_words <- findFreqTerms(sms_dtm_train, 5) length(five_times_words)  [1] 1226  five_times_words[1:5]  [1] "abiola" "able" "abt" "accept" "access"  ### Create document-term matrices using frequent words sms_train <- DocumentTermMatrix(sms_corpus_train, control=list(dictionary = five_times_words)) sms_test <- DocumentTermMatrix(sms_corpus_test, control=list(dictionary = five_times_words))  ### Convert count information to "Yes", "No" Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurances. Convert the document-term matrices. convert_count <- function(x) { y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1), labels=c("No", "Yes")) y }  ### Convert document-term matrices sms_train <- apply(sms_train, 2, convert_count) sms_test <- apply(sms_test, 2, convert_count)  ### The NaiveBayes function We'll use a Naive Bayes classifier provided in the package e1071. library(e1071)  ### Create a Naive Bayes classifier object We'll do this on the training data. sms_classifier <- naiveBayes(sms_train, factor(sms_raw_train$type))
class(sms_classifier)

[1] "naiveBayes"


### Evaluate the performance on the test data

• Given a model or classifier object there is usually a predict function to test the model on new data.
• Following predicts the classifications of messages in the test set based on the probabilities generated with the training set.
sms_test_pred <- predict(sms_classifier, newdata=sms_test)


### Check the predictions against reality

We have predictions and we have a factor of real spam-ham classifications. Generate a table.

table(sms_test_pred, sms_raw_test\$type)


sms_test_pred  ham spam
ham  1202   31
spam    5  152


### Spam filter performance

• It correctly classified 83% of the spam;
• It correctly classifies > 99% of the ham;
• This is good balance.

### Flavor of text mining

• Separate the words (or phrases) in a large body of text
• Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
• Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.
• Use frequency of words in documents to classify documents into types or associate documents by similarity