This is all material that I expect you to know, but I'll want to emphasize some features that other writers may not. This will be very sketchy because it's a review.
Atomic vectors are the most basic object in R. The possible modes are
R makes few distinctions between a number and a sequence of numbers.
A numeric vector would arise when storing readings about some set of samples.
names(v) | v |
---|---|
S1 | 1.6 |
S2 | 2.1 |
S3 | 1.5 |
S4 | 1.9 |
The names attribute must be a character vector.
v <- c(1.6, 2.1, 1.5, 1.9)
names(v) <- c("S1", "S2", "S3", "S4")
v
## S1 S2 S3 S4
## 1.6 2.1 1.5 1.9
length(v)
## [1] 4
The names attribute is optional.
names(v) <- NULL
v
## [1] 1.6 2.1 1.5 1.9
An integer vector is a special type of numeric vector. It's easy to form ranges of integers like
j <- 2:6
j
## [1] 2 3 4 5 6
You can fetch the mode of a vector with mode
or class
.
mode(v)
## [1] "numeric"
class(v)
## [1] "numeric"
Example:
s <- c("a", "BBB", "TCGA", "1", "2.1")
length(s)
## [1] 5
length("BBB")
## [1] 1
nchar("BBB")
## [1] 3
mode(c(3, 2))
## [1] "numeric"
mode(c("3", "2"))
## [1] "character"
x2 <- as.numeric(c("3", "2"))
x2
## [1] 3 2
mode(x2)
## [1] "numeric"
A logical vector is just a sequence of TRUE and FALSE instances.
lv <- c(TRUE, FALSE, TRUE)
Trick: sum
can count the number of TRUE values in a logical vector.
sum(lv)
## [1] 2
They can carry name attributes, respond to the length
command, etc.
Logical vectors arise from comparisons, normally equality, <, >, %in%, etc. These comparisons replicate across vectors.
1 < 2
## [1] TRUE
c(0, 1, 3) < 2
## [1] TRUE TRUE FALSE
c(0, 1, 3) == 1
## [1] FALSE TRUE FALSE
c(0, 1, 3) != 1
## [1] TRUE FALSE TRUE
0 %in% c(0, 1, 3)
## [1] TRUE
c(0, 2) %in% c(0, 1, 3)
## [1] TRUE FALSE
Comparison expressions can be strung into compound formulas using & (and) and | (or).
Positive integer vectors select elements from vector by index. Negative integers remove the corresponding elements.
v <- sample(1:100, size = 25)
v[15:19]
## [1] 23 67 18 30 56
v[25]
## [1] 98
v[-(3:25)]
## [1] 60 68
You can also select using the names attributes.
y <- rnorm(10)
names(y) <- paste("S", 1:10, sep = "")
y
## S1 S2 S3 S4 S5 S6 S7 S8
## 1.22968 0.48125 0.44532 0.43198 1.24532 1.28221 -0.21674 0.32227
## S9 S10
## -1.31833 -0.04364
y["S1"]
## S1
## 1.23
y[c("S2", "S4")]
## S2 S4
## 0.4813 0.4320
Putting a logical vector of the same length inside the [ ] forms a subvector consisting of those entries where it's true.
a <- c("w", "XX", "z", "fractal")
ll <- c(TRUE, FALSE, FALSE, TRUE)
a[ll]
## [1] "w" "fractal"
length(a[ll])
## [1] 2
This allows us to use comparison relations to select sequences.
# First write an expression to select 'XX' from a
a == "XX"
## [1] FALSE TRUE FALSE FALSE
# Use this to subset a
a[a == "XX"]
## [1] "XX"
It's more interesting with numerical vectors.
Go to the sample data web page and run the first source
command. This will load a vector samp_vec1 into your workspace.
The above also loaded a vector samp_vec2.
sum
of samp_vec2What happened?
Having missing values in a vector, which are recorded as NA, can complicate the application of functions and comparisons with the vector.
z1 <- c(1, 1, 2, 4, 5, NA, NA)
z1 == NA
## [1] NA NA NA NA NA NA NA
# R has a special function for identifying missing values
is.na(z1)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
!is.na(z1)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
sum(is.na(z1))
## [1] 2
Missing values complicate comparisons and subsetting too.
z1 == 1
## [1] TRUE TRUE FALSE FALSE FALSE NA NA
z1[z1 == 1]
## [1] 1 1 NA NA
z1[z1 == 1 & !is.na(z1)]
## [1] 1 1
So, I got 2 more elements than I wanted, because the mising values could be 1.
How do I get only the 1's?
Let's return to samp_vec2. How do we find the sum of the values that aren't missing? We could subset and then sum the resulting vector, but R has a quicker solution. Look up the help for sum
and solve the problem.
It may be necessary to to assign new values to entries in a vector. This may be true for individual entries and for subvectors. Such assignments are similar to subsetting but the square bracket is on the left side of a <- .
To assign to individual entries:
y
## S1 S2 S3 S4 S5 S6 S7 S8
## 1.22968 0.48125 0.44532 0.43198 1.24532 1.28221 -0.21674 0.32227
## S9 S10
## -1.31833 -0.04364
y[2] <- 5
y["S10"] <- 0
y
## S1 S2 S3 S4 S5 S6 S7 S8 S9
## 1.2297 5.0000 0.4453 0.4320 1.2453 1.2822 -0.2167 0.3223 -1.3183
## S10
## 0.0000
You can assign a single value or sequence to a range of entries.
y[1:3] <- 1
y
## S1 S2 S3 S4 S5 S6 S7 S8 S9
## 1.0000 1.0000 1.0000 0.4320 1.2453 1.2822 -0.2167 0.3223 -1.3183
## S10
## 0.0000
y[1:3] <- c(0, -1, 2)
y
## S1 S2 S3 S4 S5 S6 S7 S8 S9
## 0.0000 -1.0000 2.0000 0.4320 1.2453 1.2822 -0.2167 0.3223 -1.3183
## S10
## 0.0000
A logical vector can also be used to select the subsequent that takes the assignment. Predictably, the TRUE values of the vector within the [ ] define the subsequence.
x3 <- c("a", "b", "Xi", "mu")
sel <- c(FALSE, FALSE, TRUE, TRUE)
x3[sel] <- c("x", "m")
x3
## [1] "a" "b" "x" "m"
As with subsetting, the logical vector usually comes from an equation or inequality.
w <- rt(6, df = 4) # get 6 random values from a t-distribution with 4 degrees of freedom
w
## [1] -0.02800 0.04585 -1.66150 -0.90038 -1.73588 -1.60305
w[w > median(w)] <- 20
w
## [1] 20.000 20.000 -1.661 20.000 -1.736 -1.603
A factor is a type of character vector for which the possible values are set by the context. For a factor describing gender the two possibilities are "M" and "F". The extra structure offered by the factor class helps in some statistical modeling and data management.
Much of your time in large scale projects will be spent managing lists and data.frames (which are also lists). A list is an indexed collection of R objects. What an index points to is called a component of the list. It's like a vector but you can put anything in a component. Like a vector a list can contain names for the components.
aa <- list(c(1, 2), c("z", "ALN"), c(1, 2, 3.4, 5.6))
class(aa)
## [1] "list"
length(aa)
## [1] 3
names(aa)
## NULL
names(aa) <- c("V1", "V2", "V3")
# A quicker way to initialize with names
bb <- list(V1 = c("a", "1"), V2 = c(2, 2, 2))
bb
## $V1
## [1] "a" "1"
##
## $V2
## [1] 2 2 2
We have lots of nested indexing here? Keeping it clear is done with bracket level.
aa[[1]]
## [1] 1 2
bb$V1
## [1] "a" "1"
This gives the first component, as an object of it's own class. The list structure is gone.
aa[1]
## $V1
## [1] 1 2
class(aa[1])
## [1] "list"
This is a list of length 1.
To get sublists in general, do what you'd do for a vector.
aa[2:3]
## $V2
## [1] "z" "ALN"
##
## $V3
## [1] 1.0 2.0 3.4 5.6
A double bracket range gives an error.
How would we get the first indexed slot of every component in a list? If the list components are all numeric vectors, how do we get the means of all of them? This is what lapply
and its relatives are for.