# Working with text data: transforming and searching

March 20, 2014

## Coding Strings

### Strings versus vectors

• String is a character vector of length 1
``````a <- "the first string"
length(a)
``````
``````[1] 1
``````
• `nchar` gives the number of characters
``````nchar(a)
``````
``````[1] 16
``````
• the spaces were included

### Special characters

Longer text will contain line feeds, tabs and other non-visible characters.

``````b <- "this is a
multiline sentence"
b
``````
``````[1] "this is a \nmultiline sentence"
``````

R codes the line feed as \n.

• \n = new line
• \r = carriage return

### Extracting a substring

• `substr` lets you extract a substring by index
``````substr("This is a string", start=5, stop =8)
``````
``````[1] " is "
``````

## Concatentating vectors (and strings)

### Paste

• Input two vectors
• Concatentate the matching entries, recycling shorter vector
• Optionally, insert some character between
``````x <- c("S", "T")
y <- 1:4
paste(x,y)
``````
``````[1] "S 1" "T 2" "S 3" "T 4"
``````
``````paste(x, y, sep=":")
``````
``````[1] "S:1" "T:2" "S:3" "T:4"
``````

### Collapsing to a string

• A character vector can be collapsed to a string with `paste`
``````a <- paste(x, y, sep=":")
length(a)
``````
``````[1] 4
``````
``````(b <- paste(a, collapse=","))
``````
``````[1] "S:1,T:2,S:3,T:4"
``````
``````length(b)
``````
``````[1] 1
``````

### Practice

Create a string with “words” S1, S2, S3, S4 with space between each.

## Finding strings and patterns of strings

### Simplest case: finding substrings

• `regex` is most basic.
• It finds the first location
``````a <- "This is a sentence."
regexpr(pattern="i", a)
``````
``````[1] 3
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
``````

### General version of `regexpr`

``````gregexpr(pattern="i", a)
``````
``````[[1]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
``````

### General version of `regexpr`

``````gregexpr(pattern="i", c(a,a))
``````
``````[[1]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
``````

### First pattern

Find the location of digits in

``````p <- "1nce we went 2 a 3-ring circus"
``````
``````gregexpr(pattern="[0-9]", p)
``````
``````[[1]]
[1]  1 14 18
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
``````

Example of a regular expression

## Regular expressions: first principles

### Regular expressions: character patterns

• Patterns to describe classes of characters in input strings S
• May indicate location (start or end)
• May indicate a number of consecutive matches required
• Commonly used in scripting languages: awk, perl, python, shells

### Basic features

• Lists of alternatives for a single character in [ ]
• Alphanumeric letters match themselves
• Some metacharacters like ., \, -, \$, are used to describe patterns. To literally match them, “quote” them by preceding with \ (or double-\ in most R settings).
• Classes of characters can be described as ranges like [a-z], [A-Z], [0-9]
• Strings of characters can be written literally: “Th” matches Th.
• Some escaped letters have special meaning: \w denotes a word character , \s denotes “white space”, etc.
• Patterns can be anchored to the start ^ or end \$ of the target string.

### Examples of regular expression use

``````p
``````
``````[1] "1nce we went 2 a 3-ring circus"
``````

Match “we”.

``````gregexpr(pattern="we", p)[[1]]
``````
``````[1] 6 9
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE
``````

### Examples (cont)

``````gregexpr(pattern="[we]", p)[[1]]
``````
``````[1]  4  6  7  9 10
attr(,"match.length")
[1] 1 1 1 1 1
attr(,"useBytes")
[1] TRUE
``````

This matched occurrances of “w” or “e”.

### Exmaples (cont)

Match digits followed by white space.

``````# "[0-9]\s"
# Error: '\s' is an unrecognized escape in character string starting ""[0-9]\s"
``````
``````"[0-9]\\s"
``````
``````[1] "[0-9]\\s"
``````
``````gregexpr(pattern="[0-9]\\s", p)
``````
``````[[1]]
[1] 14
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
``````

### Exmaples (cont)

Get location of any word starting with c, then the remainder of p. Skip potentially the first word.

``````loc <- gregexpr(pattern="\\sc", p)[[1]]
loc
``````
``````[1] 24
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
``````
``````substr(p, start=loc+1, stop=nchar(p))
``````
``````[1] "circus"
``````

### Regular expression references

Don't try this on your own. Use the references.

### Practice

Count the words in

``````aa <- "There are 7 words in this sentence."
``````
``````spaces <- gregexpr(pattern="\\s", aa)[[1]]
length(spaces) + 1
``````
``````[1] 7
``````
``````wrds <- gregexpr(pattern="\\w+", aa)[[1]]
wrds
``````
``````[1]  1  7 11 13 19 22 27
attr(,"match.length")
[1] 5 3 1 5 2 4 8
attr(,"useBytes")
[1] TRUE
``````

## Substitutions

### sub and gsub

Identify a pattern and replace it with another string.

``````gsub(pattern="\\s+", replacement="", x="  words  ")
``````
``````[1] "words"
``````

Other similar functions:

`toupper` and `tolower`

## Finding entries in character vectors matching patterns

### Simple matching

``````s <- c("This", "vector", "is", "his")
``````

Find “his”.

``````"his" %in% s
``````
``````[1] TRUE
``````
``````# It is there
s == "his"
``````
``````[1] FALSE FALSE FALSE  TRUE
``````
``````# Only one true
``````

### Simple matching (cont)

``````which(s == "his")
``````
``````[1] 4
``````
``````# The fourth entry
``````

### grep

How do we find all entries containing a t?

This is what `grep` can do.

``````grep(pattern="[Tt]", s)
``````
``````[1] 1 2
``````

You can get the entries instead of the indexes by

``````grep(pattern="[Tt]", s, value=TRUE)
``````
``````[1] "This"   "vector"
``````

### More `grep`

Find the entries that start with a capital letter.

``````grep(pattern="^[A-Z]", s, value=TRUE)
``````
``````[1] "This"
``````

### What `grep` does

It matches the entire entry; it doesn't tell you where inside the entry the match occurs. For that you use `gregexpr` and deal with the list output.

### `grep` practice

Extract the entries with . Note this is a metacharacter that must be escaped.

``````bb <- c("Once", "the", "3rd", "**alarm**", "sounds", "...")
``````
``````grep("\\.", bb, value=T)
``````
``````[1] "..."
``````

## Splitting strings on patterns

### strsplit

This function splits a string or vector by a regular expression. For a single string it exports a character vector of the substrings separated by the pattern.

``````cc <- "This is
a
string"
``````
``````lines <- strsplit(x=cc, split="\\n")
lines
``````
``````[[1]]
[1] "This is " "a "       " string"
``````

Note: You always get a list returned even if the input is just a vector of length 1.

### Uses of strsplit

• Splitting text into a vector of words
• Splitting text into lines
• Separating values from variable labels in some representations of data

### Setting up data

``````records <- "sample: S15a, height: 189, weight: 89, date: 3/1/14
sample: S11, height: 185, weight: 92, date: 2/4/14
sample: S16b, height: 175, weight: 70, date: 3/1/14
"
``````

There are 3 records separated by new lines.

• Separate the records into entries in a vector
• For each record, separate the “variable: value” pairs
• Isolate the values, store them as records of a data.frame

### Creating records

``````records_sep <- strsplit(x=records, "\\n")
records_sep
``````
``````[[1]]
[1] "sample: S15a, height: 189, weight: 89, date: 3/1/14"
[2] "sample: S11, height: 185, weight: 92, date: 2/4/14"
[3] "sample: S16b, height: 175, weight: 70, date: 3/1/14"
``````
``````records_sep <- records_sep[[1]]
``````

### Separate the data for each variable

``````records1_L <- strsplit(records_sep, ",\\s?")
length(records1_L)
``````
``````[1] 3
``````
``````records1_L[[1]]
``````
``````[1] "sample: S15a" "height: 189"  "weight: 89"   "date: 3/1/14"
``````

### Separate variable names from values

``````records_LL <- lapply(records1_L, function(r) {
strsplit(r, ":\\s?")
})
records_LL[[1]]
``````
``````[[1]]
[1] "sample" "S15a"

[[2]]
[1] "height" "189"

[[3]]
[1] "weight" "89"

[[4]]
[1] "date"   "3/1/14"
``````

### Collapse the values into a vector - Example

Do it for the data from one record.

``````sapply(records_LL[[1]], function(x) x[2])
``````
``````[1] "S15a"   "189"    "89"     "3/1/14"
``````

### Collapse the values into a vector

``````records_values_L <- lapply(records_LL, function(r_L) {
sapply(r_L, function(x) x[2])
})
records_values_L
``````
``````[[1]]
[1] "S15a"   "189"    "89"     "3/1/14"

[[2]]
[1] "S11"    "185"    "92"     "2/4/14"

[[3]]
[1] "S16b"   "175"    "70"     "3/1/14"
``````

### Pack into data.frame

``````library(plyr)
dat <- ldply(records_values_L, function(ls) ls)
dat
``````
``````    V1  V2 V3     V4
1 S15a 189 89 3/1/14
2  S11 185 92 2/4/14
3 S16b 175 70 3/1/14
``````

## First principles in text mining

### Common aspects of text mining

• Separate the words (or phrases) in a large body of text
• Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
• Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.

### Packages for handling text data

• tm = text mining
• wordcloud
• RTextTools

### Example: SMS spam data

Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.

``````setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
sms_spam_df <- read.csv(file="./data/sms_spam.csv", stringsAsFactors=F)
str(sms_spam_df)
``````
``````'data.frame':   5559 obs. of  2 variables:
\$ type: chr  "ham" "ham" "ham" "spam" ...
\$ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or Â£10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...
``````

### Prepare the Corpus

A corpus is a collection of documents

``````library(tm)
``````
``````sms_corpus <- Corpus(VectorSource(sms_spam_df\$text))
print(sms_corpus)
``````
``````A corpus with 5559 text documents
``````

The `Corpus` function is very flexible. It can read PDFs, Word docs, e.g. Here, `VectorSource` told the `Corpus` function that each document was an entry in the vector.

### Inspecting the corpus

``````inspect(sms_corpus[1:3])
``````
``````A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID

[[1]]
Hope you are having a good week. Just checking in

[[2]]
K..give back my thanks.

[[3]]
Am also doing in cbe only. But have to pay.
``````

### Identify real words

Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same.

We clean up the corpus with the `tm_map` function.

### Clean corpus, part 1

``````#translate all letters to lower case
clean_corpus <- tm_map(sms_corpus, tolower)
# remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# remove punctuation
clean_corpus <- tm_map(clean_corpus, removePunctuation)
``````

### Clean corpus, part 2

Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function `stopwords` reports a list of about 175 such words.

``````stopwords()[1:10]
``````
`````` [1] "i"         "me"        "my"        "myself"    "we"
[6] "our"       "ours"      "ourselves" "you"       "your"
``````
``````clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())
``````

### Clean corpus, part 3

Finally, remove the excess white space.

``````clean_corpus <- tm_map(clean_corpus, stripWhitespace)
``````

### Inspect the clean corpus

``````inspect(clean_corpus[1:3])
``````
``````A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID

[[1]]
hope good week just checking

[[2]]
kgive back thanks

[[3]]
also cbe pay
``````

### Tokenize the corpus

A token is a single element in a text string, in most cases a word.

``````sms_dtm <- DocumentTermMatrix(clean_corpus)
sms_dtm
``````
``````A document-term matrix (5559 documents, 8210 terms)

Non-/sparse entries: 43726/45595664
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
``````

### Spam or Ham

From here you can use a naive Bayes classifier to build a spam filter based on the words in the message.

Just to visualize the differences, we create words clouds for the two groups.

### Indices of spam and ham

The indices of the two message types in the original data carry over to the corpus and doc term matrix.

``````spam_indices <- which(sms_spam_df\$type == "spam")
spam_indices[1:3]
``````
``````[1] 4 5 9
``````
``````ham_indices <- which(sms_spam_df\$type == "ham")
ham_indices[1:3]
``````
``````[1] 1 2 3
``````

### Word cloud for ham

``````library(wordcloud)
wordcloud(clean_corpus[ham_indices], min.freq=40)
``````

### Word cloud for spam

``````wordcloud(clean_corpus[spam_indices], min.freq=40)
``````