March 20, 2014
a <- "the first string"
length(a)
[1] 1
nchar
gives the number of charactersnchar(a)
[1] 16
Longer text will contain line feeds, tabs and other non-visible characters.
b <- "this is a
multiline sentence"
b
[1] "this is a \nmultiline sentence"
R codes the line feed as \n.
substr
lets you extract a substring by indexsubstr("This is a string", start=5, stop =8)
[1] " is "
x <- c("S", "T")
y <- 1:4
paste(x,y)
[1] "S 1" "T 2" "S 3" "T 4"
paste(x, y, sep=":")
[1] "S:1" "T:2" "S:3" "T:4"
paste
a <- paste(x, y, sep=":")
length(a)
[1] 4
(b <- paste(a, collapse=","))
[1] "S:1,T:2,S:3,T:4"
length(b)
[1] 1
Create a string with “words” S1, S2, S3, S4 with space between each.
regex
is most basic.a <- "This is a sentence."
regexpr(pattern="i", a)
[1] 3
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
gregexpr(pattern="i", a)
[[1]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
gregexpr(pattern="i", c(a,a))
[[1]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 3 6
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
Find the location of digits in
p <- "1nce we went 2 a 3-ring circus"
gregexpr(pattern="[0-9]", p)
[[1]]
[1] 1 14 18
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
Example of a regular expression
p
[1] "1nce we went 2 a 3-ring circus"
Match “we”.
gregexpr(pattern="we", p)[[1]]
[1] 6 9
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE
gregexpr(pattern="[we]", p)[[1]]
[1] 4 6 7 9 10
attr(,"match.length")
[1] 1 1 1 1 1
attr(,"useBytes")
[1] TRUE
This matched occurrances of “w” or “e”.
Match digits followed by white space.
# "[0-9]\s"
# Error: '\s' is an unrecognized escape in character string starting ""[0-9]\s"
"[0-9]\\s"
[1] "[0-9]\\s"
gregexpr(pattern="[0-9]\\s", p)
[[1]]
[1] 14
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
Get location of any word starting with c, then the remainder of p. Skip potentially the first word.
loc <- gregexpr(pattern="\\sc", p)[[1]]
loc
[1] 24
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
substr(p, start=loc+1, stop=nchar(p))
[1] "circus"
Don't try this on your own. Use the references.
regex
help topic in RCount the words in
aa <- "There are 7 words in this sentence."
spaces <- gregexpr(pattern="\\s", aa)[[1]]
length(spaces) + 1
[1] 7
wrds <- gregexpr(pattern="\\w+", aa)[[1]]
wrds
[1] 1 7 11 13 19 22 27
attr(,"match.length")
[1] 5 3 1 5 2 4 8
attr(,"useBytes")
[1] TRUE
Identify a pattern and replace it with another string.
gsub(pattern="\\s+", replacement="", x=" words ")
[1] "words"
Other similar functions:
toupper
and tolower
s <- c("This", "vector", "is", "his")
Find “his”.
"his" %in% s
[1] TRUE
# It is there
s == "his"
[1] FALSE FALSE FALSE TRUE
# Only one true
which(s == "his")
[1] 4
# The fourth entry
How do we find all entries containing a t?
This is what grep
can do.
grep(pattern="[Tt]", s)
[1] 1 2
You can get the entries instead of the indexes by
grep(pattern="[Tt]", s, value=TRUE)
[1] "This" "vector"
Find the entries that start with a capital letter.
grep(pattern="^[A-Z]", s, value=TRUE)
[1] "This"
It matches the entire entry; it doesn't tell you where inside the entry the match occurs. For that you use gregexpr
and deal with the list output.
Extract the entries with . Note this is a metacharacter that must be escaped.
bb <- c("Once", "the", "3rd", "**alarm**", "sounds", "...")
grep("\\.", bb, value=T)
[1] "..."
This function splits a string or vector by a regular expression. For a single string it exports a character vector of the substrings separated by the pattern.
cc <- "This is
a
string"
lines <- strsplit(x=cc, split="\\n")
lines
[[1]]
[1] "This is " "a " " string"
Note: You always get a list returned even if the input is just a vector of length 1.
records <- "sample: S15a, height: 189, weight: 89, date: 3/1/14
sample: S11, height: 185, weight: 92, date: 2/4/14
sample: S16b, height: 175, weight: 70, date: 3/1/14
"
There are 3 records separated by new lines.
records_sep <- strsplit(x=records, "\\n")
records_sep
[[1]]
[1] "sample: S15a, height: 189, weight: 89, date: 3/1/14"
[2] "sample: S11, height: 185, weight: 92, date: 2/4/14"
[3] "sample: S16b, height: 175, weight: 70, date: 3/1/14"
records_sep <- records_sep[[1]]
records1_L <- strsplit(records_sep, ",\\s?")
length(records1_L)
[1] 3
records1_L[[1]]
[1] "sample: S15a" "height: 189" "weight: 89" "date: 3/1/14"
records_LL <- lapply(records1_L, function(r) {
strsplit(r, ":\\s?")
})
records_LL[[1]]
[[1]]
[1] "sample" "S15a"
[[2]]
[1] "height" "189"
[[3]]
[1] "weight" "89"
[[4]]
[1] "date" "3/1/14"
Do it for the data from one record.
sapply(records_LL[[1]], function(x) x[2])
[1] "S15a" "189" "89" "3/1/14"
records_values_L <- lapply(records_LL, function(r_L) {
sapply(r_L, function(x) x[2])
})
records_values_L
[[1]]
[1] "S15a" "189" "89" "3/1/14"
[[2]]
[1] "S11" "185" "92" "2/4/14"
[[3]]
[1] "S16b" "175" "70" "3/1/14"
library(plyr)
dat <- ldply(records_values_L, function(ls) ls)
dat
V1 V2 V3 V4
1 S15a 189 89 3/1/14
2 S11 185 92 2/4/14
3 S16b 175 70 3/1/14
Following is a study of SMS records used to train a spam filter. It's been prepared to be in a .csv file with columns for type (“spam” or “ham”) and the text of the message.
setwd("/Users/steve/Documents/Computing with Data/19_strings_and_text")
sms_spam_df <- read.csv(file="./data/sms_spam.csv", stringsAsFactors=F)
str(sms_spam_df)
'data.frame': 5559 obs. of 2 variables:
$ type: chr "ham" "ham" "ham" "spam" ...
$ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...
A corpus is a collection of documents
library(tm)
sms_corpus <- Corpus(VectorSource(sms_spam_df$text))
print(sms_corpus)
A corpus with 5559 text documents
The Corpus
function is very flexible. It can read PDFs, Word docs, e.g. Here, VectorSource
told the Corpus
function that each document was an entry in the vector.
inspect(sms_corpus[1:3])
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Hope you are having a good week. Just checking in
[[2]]
K..give back my thanks.
[[3]]
Am also doing in cbe only. But have to pay.
Different text may contain “Hello!”, “Hello,” “hello…”, etc. We would consider all of these the same.
We clean up the corpus with the tm_map
function.
#translate all letters to lower case
clean_corpus <- tm_map(sms_corpus, tolower)
# remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# remove punctuation
clean_corpus <- tm_map(clean_corpus, removePunctuation)
Copmmon practice: remove common non-content words, like to, and, the, … These are called stop words in the lingo. The function stopwords
reports a list of about 175 such words.
stopwords()[1:10]
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords())
Finally, remove the excess white space.
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
inspect(clean_corpus[1:3])
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
hope good week just checking
[[2]]
kgive back thanks
[[3]]
also cbe pay
A token is a single element in a text string, in most cases a word.
sms_dtm <- DocumentTermMatrix(clean_corpus)
sms_dtm
A document-term matrix (5559 documents, 8210 terms)
Non-/sparse entries: 43726/45595664
Sparsity : 100%
Maximal term length: 40
Weighting : term frequency (tf)
From here you can use a naive Bayes classifier to build a spam filter based on the words in the message.
Just to visualize the differences, we create words clouds for the two groups.
The indices of the two message types in the original data carry over to the corpus and doc term matrix.
spam_indices <- which(sms_spam_df$type == "spam")
spam_indices[1:3]
[1] 4 5 9
ham_indices <- which(sms_spam_df$type == "ham")
ham_indices[1:3]
[1] 1 2 3
library(wordcloud)
wordcloud(clean_corpus[ham_indices], min.freq=40)
wordcloud(clean_corpus[spam_indices], min.freq=40)