Working with text data: transforming and searching

March 20, 2014

Coding Strings

Strings versus vectors

  • String is a character vector of length 1
a <- "the first string"
[1] 1
  • nchar gives the number of characters
[1] 16
  • the spaces were included

Special characters

Longer text will contain line feeds, tabs and other non-visible characters.

b <- "this is a 
multiline sentence"
[1] "this is a \nmultiline sentence"

R codes the line feed as \n.

  • \n = new line
  • \r = carriage return

Extracting a substring

  • substr lets you extract a substring by index
substr("This is a string", start=5, stop =8)
[1] " is "

Concatentating vectors (and strings)


  • Input two vectors
  • Concatentate the matching entries, recycling shorter vector
  • Optionally, insert some character between
x <- c("S", "T")
y <- 1:4
[1] "S 1" "T 2" "S 3" "T 4"
paste(x, y, sep=":")
[1] "S:1" "T:2" "S:3" "T:4"

Collapsing to a string

  • A character vector can be collapsed to a string with paste
a <- paste(x, y, sep=":")
[1] 4
(b <- paste(a, collapse=","))
[1] "S:1,T:2,S:3,T:4"
[1] 1


Create a string with “words” S1, S2, S3, S4 with space between each.

Finding strings and patterns of strings

Simplest case: finding substrings

  • regex is most basic.
  • It finds the first location
a <- "This is a sentence."
regexpr(pattern="i", a)
[1] 3
[1] 1
[1] TRUE

General version of `regexpr`

gregexpr(pattern="i", a)
[1] 3 6
[1] 1 1
[1] TRUE

General version of `regexpr`

gregexpr(pattern="i", c(a,a))
[1] 3 6
[1] 1 1
[1] TRUE

[1] 3 6
[1] 1 1
[1] TRUE

First pattern

Find the location of digits in

p <- "1nce we went 2 a 3-ring circus"
gregexpr(pattern="[0-9]", p)
[1]  1 14 18
[1] 1 1 1
[1] TRUE

Example of a regular expression

Regular expressions: first principles

Regular expressions: character patterns

  • Patterns to describe classes of characters in input strings S
  • May indicate location (start or end)
  • May indicate a number of consecutive matches required
  • Commonly used in scripting languages: awk, perl, python, shells

Basic features

  • Lists of alternatives for a single character in [ ]
  • Alphanumeric letters match themselves
  • Some metacharacters like ., \, -, $, are used to describe patterns. To literally match them, “quote” them by preceding with \ (or double-\ in most R settings).
  • Classes of characters can be described as ranges like [a-z], [A-Z], [0-9]
  • Strings of characters can be written literally: “Th” matches Th.
  • Some escaped letters have special meaning: \w denotes a word character , \s denotes “white space”, etc.
  • Patterns can be anchored to the start ^ or end $ of the target string.

Examples of regular expression use

[1] "1nce we went 2 a 3-ring circus"

Match “we”.

gregexpr(pattern="we", p)[[1]]
[1] 6 9
[1] 2 2
[1] TRUE

Examples (cont)

gregexpr(pattern="[we]", p)[[1]]
[1]  4  6  7  9 10
[1] 1 1 1 1 1
[1] TRUE

This matched occurrances of “w” or “e”.

Exmaples (cont)

Match digits followed by white space.

# "[0-9]\s"
# Error: '\s' is an unrecognized escape in character string starting ""[0-9]\s"
[1] "[0-9]\\s"
gregexpr(pattern="[0-9]\\s", p)
[1] 14
[1] 2
[1] TRUE

Exmaples (cont)

Get location of any word starting with c, then the remainder of p. Skip potentially the first word.

loc <- gregexpr(pattern="\\sc", p)[[1]]
[1] 24
[1] 2
[1] TRUE
substr(p, start=loc+1, stop=nchar(p))
[1] "circus"

Regular expression references

Don't try this on your own. Use the references.


Count the words in

aa <- "There are 7 words in this sentence."
spaces <- gregexpr(pattern="\\s", aa)[[1]]
length(spaces) + 1
[1] 7
wrds <- gregexpr(pattern="\\w+", aa)[[1]]
[1]  1  7 11 13 19 22 27
[1] 5 3 1 5 2 4 8
[1] TRUE


sub and gsub

Identify a pattern and replace it with another string.

gsub(pattern="\\s+", replacement="", x="  words  ")
[1] "words"

Other similar functions:

toupper and tolower

Finding entries in character vectors matching patterns

Simple matching

s <- c("This", "vector", "is", "his")

Find “his”.

"his" %in% s
[1] TRUE
# It is there
s == "his"
# Only one true

Simple matching (cont)

which(s == "his")
[1] 4
# The fourth entry


How do we find all entries containing a t?

This is what grep can do.

grep(pattern="[Tt]", s)
[1] 1 2

You can get the entries instead of the indexes by

grep(pattern="[Tt]", s, value=TRUE)
[1] "This"   "vector"

More `grep`

Find the entries that start with a capital letter.

grep(pattern="^[A-Z]", s, value=TRUE)
[1] "This"

What `grep` does

It matches the entire entry; it doesn't tell you where inside the entry the match occurs. For that you use gregexpr and deal with the list output.

`grep` practice

Extract the entries with . Note this is a metacharacter that must be escaped.

bb <- c("Once", "the", "3rd", "**alarm**", "sounds", "...")
grep("\\.", bb, value=T)
[1] "..."

Splitting strings on patterns


This function splits a string or vector by a regular expression. For a single string it exports a character vector of the substrings separated by the pattern.

cc <- "This is 
lines <- strsplit(x=cc, split="\\n")
[1] "This is " "a "       " string" 

Note: You always get a list returned even if the input is just a vector of length 1.

Uses of strsplit

  • Splitting text into a vector of words
  • Splitting text into lines
  • Separating values from variable labels in some representations of data

Setting up data

records <- "sample: S15a, height: 189, weight: 89, date: 3/1/14
sample: S11, height: 185, weight: 92, date: 2/4/14
sample: S16b, height: 175, weight: 70, date: 3/1/14

There are 3 records separated by new lines.

  • Separate the records into entries in a vector
  • For each record, separate the “variable: value” pairs
  • Isolate the values, store them as records of a data.frame

Creating records

records_sep <- strsplit(x=records, "\\n")
[1] "sample: S15a, height: 189, weight: 89, date: 3/1/14"
[2] "sample: S11, height: 185, weight: 92, date: 2/4/14" 
[3] "sample: S16b, height: 175, weight: 70, date: 3/1/14"
records_sep <- records_sep[[1]]

Separate the data for each variable

records1_L <- strsplit(records_sep, ",\\s?")
[1] 3
[1] "sample: S15a" "height: 189"  "weight: 89"   "date: 3/1/14"

Separate variable names from values

records_LL <- lapply(records1_L, function(r) {
  strsplit(r, ":\\s?")
[1] "sample" "S15a"  

[1] "height" "189"   

[1] "weight" "89"    

[1] "date"   "3/1/14"

Collapse the values into a vector - Example

Do it for the data from one record.

sapply(records_LL[[1]], function(x) x[2])
[1] "S15a"   "189"    "89"     "3/1/14"

Collapse the values into a vector

records_values_L <- lapply(records_LL, function(r_L) {
  sapply(r_L, function(x) x[2])
[1] "S15a"   "189"    "89"     "3/1/14"

[1] "S11"    "185"    "92"     "2/4/14"

[1] "S16b"   "175"    "70"     "3/1/14"

Pack into data.frame

dat <- ldply(records_values_L, function(ls) ls)
    V1  V2 V3     V4
1 S15a 189 89 3/1/14
2  S11 185 92 2/4/14
3 S16b 175 70 3/1/14

First principles in text mining

Common aspects of text mining

  • Separate the words (or phrases) in a large body of text
  • Clean up the data by eliminating punctuation, numbers, homogenizing on case, removing non-content words like “The”
  • Create incidence matrix for words in a document (called a Document-Term Matrix), or some other way of studying word frequency and document similarity. The process is called tokenization.

Packages for handling text data

  • tm = text mining
  • wordcloud
  • RTextTools

Example: SMS spam data

Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.

setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
sms_spam_df <- read.csv(file="./data/sms_spam.csv", stringsAsFactors=F)
'data.frame':   5559 obs. of  2 variables:
 $ type: chr  "ham" "ham" "ham" "spam" ...
 $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...

Prepare the Corpus

A corpus is a collection of documents

sms_corpus <- Corpus(VectorSource(sms_spam_df$text))
A corpus with 5559 text documents

The Corpus function is very flexible. It can read PDFs, Word docs, e.g. Here, VectorSource told the Corpus function that each document was an entry in the vector.

Inspecting the corpus

A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:

Hope you are having a good week. Just checking in

K..give back my thanks.

Am also doing in cbe only. But have to pay.

Identify real words

Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same.

We clean up the corpus with the tm_map function.

Clean corpus, part 1

#translate all letters to lower case
clean_corpus <- tm_map(sms_corpus, tolower)
# remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# remove punctuation
clean_corpus <- tm_map(clean_corpus, removePunctuation)

Clean corpus, part 2

Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function stopwords reports a list of about 175 such words.

 [1] "i"         "me"        "my"        "myself"    "we"       
 [6] "our"       "ours"      "ourselves" "you"       "your"     
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())

Clean corpus, part 3

Finally, remove the excess white space.

clean_corpus <- tm_map(clean_corpus, stripWhitespace)

Inspect the clean corpus

A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:

hope good week just checking 

kgive back thanks

 also cbe pay

Tokenize the corpus

A token is a single element in a text string, in most cases a word.

sms_dtm <- DocumentTermMatrix(clean_corpus)
A document-term matrix (5559 documents, 8210 terms)

Non-/sparse entries: 43726/45595664
Sparsity           : 100%
Maximal term length: 40 
Weighting          : term frequency (tf)

Spam or Ham

From here you can use a naive Bayes classifier to build a spam filter based on the words in the message.

Just to visualize the differences, we create words clouds for the two groups.

Indices of spam and ham

The indices of the two message types in the original data carry over to the corpus and doc term matrix.

spam_indices <- which(sms_spam_df$type == "spam")
[1] 4 5 9
ham_indices <- which(sms_spam_df$type == "ham")
[1] 1 2 3

Word cloud for ham

wordcloud(clean_corpus[ham_indices], min.freq=40)

plot of chunk unnamed-chunk-49

Word cloud for spam

wordcloud(clean_corpus[spam_indices], min.freq=40)

plot of chunk unnamed-chunk-50