Vectors, factors and lists

This is all material that I expect you to know, but I'll want to emphasize some features that other writers may not. This will be very sketchy because it's a review.

Atomic vectors

Atomic vectors are the most basic object in R. The possible modes are

  1. numeric
  2. character
  3. logical

R makes few distinctions between a number and a sequence of numbers.

Numeric vectors

A numeric vector would arise when storing readings about some set of samples.

names(v) v
S1 1.6
S2 2.1
S3 1.5
S4 1.9

The names attribute must be a character vector.

v <- c(1.6, 2.1, 1.5, 1.9)
names(v) <- c("S1", "S2", "S3", "S4")
v
##  S1  S2  S3  S4 
## 1.6 2.1 1.5 1.9
length(v)
## [1] 4

The names attribute is optional.

names(v) <- NULL
v
## [1] 1.6 2.1 1.5 1.9

An integer vector is a special type of numeric vector. It's easy to form ranges of integers like

j <- 2:6
j
## [1] 2 3 4 5 6

You can fetch the mode of a vector with mode or class.

mode(v)
## [1] "numeric"
class(v)
## [1] "numeric"

Character vectors

Example:

s <- c("a", "BBB", "TCGA", "1", "2.1")
length(s)
## [1] 5
length("BBB")
## [1] 1
nchar("BBB")
## [1] 3
mode(c(3, 2))
## [1] "numeric"
mode(c("3", "2"))
## [1] "character"
x2 <- as.numeric(c("3", "2"))
x2
## [1] 3 2
mode(x2)
## [1] "numeric"

Logical vectors

A logical vector is just a sequence of TRUE and FALSE instances.

lv <- c(TRUE, FALSE, TRUE)

Trick: sum can count the number of TRUE values in a logical vector.

sum(lv)
## [1] 2

They can carry name attributes, respond to the length command, etc.

Comparisons

Logical vectors arise from comparisons, normally equality, <, >, %in%, etc. These comparisons replicate across vectors.

1 < 2
## [1] TRUE
c(0, 1, 3) < 2
## [1]  TRUE  TRUE FALSE
c(0, 1, 3) == 1
## [1] FALSE  TRUE FALSE
c(0, 1, 3) != 1
## [1]  TRUE FALSE  TRUE
0 %in% c(0, 1, 3)
## [1] TRUE
c(0, 2) %in% c(0, 1, 3)
## [1]  TRUE FALSE

Comparison expressions can be strung into compound formulas using & (and) and | (or).

Restricting and subsetting vectors

Positive integer vectors select elements from vector by index. Negative integers remove the corresponding elements.

v <- sample(1:100, size = 25)
v[15:19]
## [1] 23 67 18 30 56
v[25]
## [1] 98
v[-(3:25)]
## [1] 60 68

You can also select using the names attributes.

y <- rnorm(10)
names(y) <- paste("S", 1:10, sep = "")
y
##       S1       S2       S3       S4       S5       S6       S7       S8 
##  1.22968  0.48125  0.44532  0.43198  1.24532  1.28221 -0.21674  0.32227 
##       S9      S10 
## -1.31833 -0.04364
y["S1"]
##   S1 
## 1.23
y[c("S2", "S4")]
##     S2     S4 
## 0.4813 0.4320

Subsetting with a logical vector

Putting a logical vector of the same length inside the [ ] forms a subvector consisting of those entries where it's true.

a <- c("w", "XX", "z", "fractal")
ll <- c(TRUE, FALSE, FALSE, TRUE)
a[ll]
## [1] "w"       "fractal"
length(a[ll])
## [1] 2

This allows us to use comparison relations to select sequences.

# First write an expression to select 'XX' from a
a == "XX"
## [1] FALSE  TRUE FALSE FALSE
# Use this to subset a
a[a == "XX"]
## [1] "XX"

It's more interesting with numerical vectors.

Example 1

Go to the sample data web page and run the first source command. This will load a vector samp_vec1 into your workspace.

  1. What are the length and mode of the vector?
  2. Define a subvector b consisting of the elements of a greater than the median.

Example 2

The above also loaded a vector samp_vec2.

  1. Compute the sum of samp_vec2

What happened?

Missing data

Having missing values in a vector, which are recorded as NA, can complicate the application of functions and comparisons with the vector.

z1 <- c(1, 1, 2, 4, 5, NA, NA)
z1 == NA
## [1] NA NA NA NA NA NA NA
# R has a special function for identifying missing values
is.na(z1)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
!is.na(z1)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
sum(is.na(z1))
## [1] 2

Missing values complicate comparisons and subsetting too.

z1 == 1
## [1]  TRUE  TRUE FALSE FALSE FALSE    NA    NA
z1[z1 == 1]
## [1]  1  1 NA NA
z1[z1 == 1 & !is.na(z1)]
## [1] 1 1

So, I got 2 more elements than I wanted, because the mising values could be 1.

How do I get only the 1's?

na.rm option

Let's return to samp_vec2. How do we find the sum of the values that aren't missing? We could subset and then sum the resulting vector, but R has a quicker solution. Look up the help for sum and solve the problem.

Assignment to subvectors

It may be necessary to to assign new values to entries in a vector. This may be true for individual entries and for subvectors. Such assignments are similar to subsetting but the square bracket is on the left side of a <- .

To assign to individual entries:

y
##       S1       S2       S3       S4       S5       S6       S7       S8 
##  1.22968  0.48125  0.44532  0.43198  1.24532  1.28221 -0.21674  0.32227 
##       S9      S10 
## -1.31833 -0.04364
y[2] <- 5
y["S10"] <- 0
y
##      S1      S2      S3      S4      S5      S6      S7      S8      S9 
##  1.2297  5.0000  0.4453  0.4320  1.2453  1.2822 -0.2167  0.3223 -1.3183 
##     S10 
##  0.0000

You can assign a single value or sequence to a range of entries.

y[1:3] <- 1
y
##      S1      S2      S3      S4      S5      S6      S7      S8      S9 
##  1.0000  1.0000  1.0000  0.4320  1.2453  1.2822 -0.2167  0.3223 -1.3183 
##     S10 
##  0.0000
y[1:3] <- c(0, -1, 2)
y
##      S1      S2      S3      S4      S5      S6      S7      S8      S9 
##  0.0000 -1.0000  2.0000  0.4320  1.2453  1.2822 -0.2167  0.3223 -1.3183 
##     S10 
##  0.0000

A logical vector can also be used to select the subsequent that takes the assignment. Predictably, the TRUE values of the vector within the [ ] define the subsequence.

x3 <- c("a", "b", "Xi", "mu")
sel <- c(FALSE, FALSE, TRUE, TRUE)
x3[sel] <- c("x", "m")
x3
## [1] "a" "b" "x" "m"

As with subsetting, the logical vector usually comes from an equation or inequality.

w <- rt(6, df = 4)  # get 6 random values from a t-distribution with 4 degrees of freedom
w
## [1] -0.02800  0.04585 -1.66150 -0.90038 -1.73588 -1.60305
w[w > median(w)] <- 20
w
## [1] 20.000 20.000 -1.661 20.000 -1.736 -1.603

Factors

A factor is a type of character vector for which the possible values are set by the context. For a factor describing gender the two possibilities are "M" and "F". The extra structure offered by the factor class helps in some statistical modeling and data management.

Lists

Much of your time in large scale projects will be spent managing lists and data.frames (which are also lists). A list is an indexed collection of R objects. What an index points to is called a component of the list. It's like a vector but you can put anything in a component. Like a vector a list can contain names for the components.

aa <- list(c(1, 2), c("z", "ALN"), c(1, 2, 3.4, 5.6))
class(aa)
## [1] "list"
length(aa)
## [1] 3
names(aa)
## NULL
names(aa) <- c("V1", "V2", "V3")
# A quicker way to initialize with names
bb <- list(V1 = c("a", "1"), V2 = c(2, 2, 2))
bb
## $V1
## [1] "a" "1"
## 
## $V2
## [1] 2 2 2

Indexing a list

We have lots of nested indexing here? Keeping it clear is done with bracket level.

aa[[1]]
## [1] 1 2
bb$V1
## [1] "a" "1"

This gives the first component, as an object of it's own class. The list structure is gone.

aa[1]
## $V1
## [1] 1 2
class(aa[1])
## [1] "list"

This is a list of length 1.

To get sublists in general, do what you'd do for a vector.

aa[2:3]
## $V2
## [1] "z"   "ALN"
## 
## $V3
## [1] 1.0 2.0 3.4 5.6

A double bracket range gives an error.

Look ahead

How would we get the first indexed slot of every component in a list? If the list components are all numeric vectors, how do we get the means of all of them? This is what lapply and its relatives are for.