Looping and iterators, part I

Usual knocks on R

Both of these problems make it of limited use for handling big data and situations where millions of calculations need to be done.

Libraries like foreach, iterator and ff together with packages to facilitate parallelization are relieving the deficiencies.

Base R structures and limitations

the for loop

for (i in 1:5) {
    print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

Initialize the output container

i_squared <- integer(5)
for (i in 1:5) {
    i_squared[i] <- i^2
}
i_squared
## [1]  1  4  9 16 25

Speed test

i_squared <- integer(1e+09)
system.time(for (i in 1e+09) {
    i_squared[i] <- i^2
})
##    user  system elapsed 
##   2.834   2.737   6.005

Limitations of for

Getting started with foreach

The utility of foreach and it's syntax are different from for.

library(foreach)
x <- foreach(i = 1:3) %do% sqrt(i)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414
## 
## [[3]]
## [1] 1.732

Differences:
1. = instead of in for the index variable;
2. %do% to run a function described afterwards;
3. foreach returns the computed output for each iteration, by default, in a list.

This works a little like lapply

Options for multiple iteration indices

You can input multiple indices with the following result.

x <- foreach(a = 1:10, b = c(25, 40)) %do% {
    a + b
}
x
## [[1]]
## [1] 26
## 
## [[2]]
## [1] 42

The indices were paired up and stopped when you run out of the shortest one.

Example

Rewrite the following timed for loop with foreach and inspect the time needed.

i_squared <- integer(1e3)
system.time(
for(i in 1e3) {i_squared[i] <- i**2}
)

This was faster with foreach but it doesn't scale very well as the iterator length increases.

Options for combining the outputs

By default, output is returned as a list to accommodate any possible return value. However, the .combine option can tell foreach to return the outputs in another format.

x <- foreach(i = 1:5, .combine = "c") %do% log(i)
x
## [1] 0.0000 0.6931 1.0986 1.3863 1.6094

This returned a numeric vector like sapply. Each output is a single number, so applying c() will produce a numeric vector.

This is a flexible utility that can deal with vectors, too.

x <- foreach(i = 1:5, .combine = "cbind") %do% rnorm(3)
x
##      result.1 result.2 result.3 result.4 result.5
## [1,]   0.8963   1.5228   0.9151  -0.2878   0.8880
## [2,]  -0.5545   0.6060   1.1870   0.3002  -0.9591
## [3,]  -0.3946  -0.2651  -0.9609   0.8679  -0.4606
y <- foreach(i = 1:5, .combine = "rbind") %do% rnorm(3)
y
##              [,1]    [,2]    [,3]
## result.1 -1.16722 -0.9703  0.3899
## result.2  0.06181 -0.1538  0.5110
## result.3 -0.62578  1.4343 -0.9803
## result.4  0.66444 -0.8928 -0.1510
## result.5  1.60457 -0.2227 -1.7464

In this way we can create matrix output.

You can also do numeric processing on the outputs.

myfun <- function(a, b) {
    c <- cbind(a, b)
    apply(c, 1, mean)
}
foreach(i = 1:5, .combine = "myfun") %do% rnorm(3)
## [1]  0.76257 -0.35699  0.01977

Iterators

At the heart of any looping structure is a variable that iterates sequencially through a collection of values. Any any place in the script it remembers what the current value is and can produce the next value to be processed. Besides being handy in uses of foreach this proves very useful in processing huge amounts of data in manageable chunks.

Almost all objects in R that contain data are indexed, hence can be made into iterators.

library(iterators)
i1 <- iter(1:20)
nextElem(i1)
## [1] 1
nextElem(i1)
## [1] 2

You can turn turn a matrix or data.frame into an iterator in which iteration is over rows or columns.

Consider the state.x77 matrix provided in the iterators package.

state.x77[1:3, 1:5]
##         Population Income Illiteracy Life Exp Murder
## Alabama       3615   3624        2.1    69.05   15.1
## Alaska         365   6315        1.5    69.31   11.3
## Arizona       2212   4530        1.8    70.55    7.8

Each row contains statistics for a state.

istate <- iter(state.x77, by = "row")
nextElem(istate)
##         Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
nextElem(istate)
##        Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432

iter can also be applied to a function. Each element of the iterator is a new call to the function, often a random sampler.

ifun <- iter(function() sample(0:8, size = 9, replace = T))
nextElem(ifun)
## [1] 6 1 5 3 4 5 8 5 7
nextElem(ifun)
## [1] 7 6 4 1 3 2 4 0 5

The iterators package has some functions that make it easy to define common iterators.

a <- icount(20)  # an iterator running from 1 to 20
a <- irnorm(2, count = 5)  # 5 samplings of pairs from a normal distribution 

A very useful one we'll see later iterates over lines in a file.

You can combine this with foreach to write you're own apply-style function.

mat <- matrix(rnorm(1e+06), ncol = 10000)
dim(mat)
## [1]   100 10000
itx <- iter(mat, by = "row")
row_means <- foreach(i = itx, .combine = "c") %do% mean(i)
row_means[1:10]
##  [1] -0.0216767  0.0006555  0.0013509  0.0051333  0.0081804 -0.0022249
##  [7]  0.0189809 -0.0100963 -0.0137118 -0.0059361

We can parallelize this to make it faster when the function we apply to the row is intensive.

Conditional execution

In some situations we may want to apply a function to one of the iteration values only when a certain condition is met. There is a way to add that on to foreach. This can be very handy when lots of data is stored in one object but we're only interested in a small percentage of it.

For example,

x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %do% 
    sqrt(i)
length(x)
## [1] 11
x
##  [1] 1.0770 1.1556 1.0300 1.3946 0.6176 0.3800 0.5537 0.7717 0.7924 0.2969
## [11] 0.7355

Parallelization with multiple cores

A most frequent use of foreach is to execute many compute-intensive iterations in parallel. The simplest form of parallelisation is to use multiple cores in a single computer. A package to do this is doMC. After loading the library we need to register the workers before using it is a parallel processor.

library(doMC)
## Loading required package: parallel
registerDoMC()
getDoParWorkers()  # Checking the number of cores. 
## [1] 4

Parallel foreach

After initializing a parallel method, simply replace %do% by %dopar%

x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %dopar% 
    sqrt(i)

Time the differences between the single processor and multicore implementations.

system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >= 
    0) %do% sqrt(i))
##    user  system elapsed 
##   0.165   0.001   0.167
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >= 
    0) %dopar% sqrt(i))
##    user  system elapsed 
##   0.162   0.030   0.177

The parallel version actually took a little longer. If the core function takes a long time to run, this changes.

system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >= 
    0) %do% {
    Sys.sleep(0.01)
    sqrt(i)
})
##    user  system elapsed 
##   0.474   0.029   6.062
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >= 
    0) %dopar% {
    Sys.sleep(0.01)
    sqrt(i)
})
##    user  system elapsed 
##   0.218   0.072   1.617

The parallel version is significantly fast.