See the Work Along - Functions on matrices and lists for in-class exercises.

setwd("~/Documents/Computing with Data/9_Functions_matrices_lists/")

Applying functions to rows and columns of matrices

In practical data analysis, a common step is to apply a function to each row of a matrix. We can use a for loop to do this, but it is such a common task that R offers a better way.

Set up a random integer 4 x 3 matrix.

set.seed(123)
v <- sample(x = -10:10, size = 12)
M1 <- matrix(v, nrow=4)
M1
##      [,1] [,2] [,3]
## [1,]   -4    7    4
## [2,]    5  -10   -5
## [3,]   -3    8    0
## [4,]    9    2   -6

Problem Compute the mean values of each row.

First, use a loop.

row_means <- rep(NA, times = nrow(M1))
for(j in 1:nrow(M1)) {
  row_means[j] <- mean(M1[j,])
}
row_means
## [1]  2.333 -3.333  1.667  1.667

The second solution uses the new function, apply. apply has 3 arguments: the first is a matrix (or data.frame), the second is 1 or 2, the third is a function. If the second argument is 1, apply applies the function to each row of the matrix. If it is 2, apply the function to each column. apply then returns the vector of outputs.

r_means <- apply(M1, 1, mean)
r_means
## [1]  2.333 -3.333  1.667  1.667

This is very similar to vectorizing a numeric function over a numeric vector. apply says to vectorize a numeric function over the rows of a matrix.

More natural example

For this example, we consider a large matrix arising from cancer biology:

load("./Data/expr_mat")
dim(expr_mat)
## [1] 22283  1269
expr_mat[1:5, 1:6]
##           GSM79114 GSM79118 GSM79119 GSM79120 GSM79121 GSM79123
## 1007_s_at    9.990    9.536   10.189    9.682   10.604    9.057
## 1053_at      4.241    5.957    5.579    5.334    5.036    5.910
## 117_at       3.406    3.615    3.573    3.741    3.755    3.458
## 121_at       5.689    5.476    4.833    4.837    4.833    4.838
## 1255_g_at    2.224    2.224    2.224    2.224    2.224    2.224
object.size(expr_mat)
## 227725328 bytes

Each row corresponds to a gene and each column a cancer tumor sample. The number in a particular cell measures the concentration of the gene’s output in the relevant sample.

A natural question is “Which genes vary the most over all of the samples?”

We’ll use variance, although interquartile range (IQR) is another common option.

row_vars <- apply(expr_mat, 1, var)

Here is the loop method.

r_vars <- numeric(nrow(expr_mat))
  for(j in 1:nrow(expr_mat)) {
    r_vars[j] <- var(expr_mat[j,])
  }

Execution times are very similar. If we have 50 million rows, then apply will offer a speed-up. The big advantage in everyday life is fewer lines of code to write.

Apply can also output a matrix of values:

sum_stats <- apply(expr_mat, 1, summary)
dim(sum_stats)
## [1]     6 22283
sum_stats[ , 1:5]
##         1007_s_at 1053_at 117_at 121_at 1255_g_at
## Min.         7.06    4.05   2.69   3.66      2.22
## 1st Qu.      9.06    5.54   3.46   4.83      2.22
## Median       9.49    5.91   3.61   4.83      2.22
## Mean         9.47    5.92   3.78   4.94      2.22
## 3rd Qu.      9.89    6.24   3.90   5.02      2.22
## Max.        11.50    8.90  10.20   7.66      2.28

More complex function

It’s very common to need a custom function with apply. The third slot in apply must be a function that takes a row (or column) as input.

Problem For each sample, identify a gene with maximal expression in that sample, and return the gene id.

First define the function that does the work.

sel_max <- function(v) {
  names(v) <- rownames(expr_mat)
  v2 <- sort(v, decreasing=T)
  w <- names(v2)[1]
  w
}
maximas <- apply(expr_mat, 2, sel_max)
maximas[1:3]
##      GSM79114      GSM79118      GSM79119 
## "202310_s_at" "213867_x_at" "213867_x_at"

Frequently, we just put the definition of the function directly within the apply.

Applying a function to each component of a list

There is a corresponding function that enables the application of a function to each component of a list.