# Looping and iterators, part I

Usual knocks on R

• It's slow
• Objects have to be loaded into memory to anaylze them.

Both of these problems make it of limited use for handling big data and situations where millions of calculations need to be done.

Libraries like `foreach`, `iterator` and `ff` together with packages to facilitate parallelization are relieving the deficiencies.

# Base R structures and limitations

### the `for` loop

``````for (i in 1:5) {
print(i^2)
}
``````
``````## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
``````

### Initialize the output container

``````i_squared <- integer(5)
for (i in 1:5) {
i_squared[i] <- i^2
}
i_squared
``````
``````## [1]  1  4  9 16 25
``````

### Speed test

``````i_squared <- integer(1e+09)
system.time(for (i in 1e+09) {
i_squared[i] <- i^2
})
``````
``````##    user  system elapsed
##   2.834   2.737   6.005
``````

### Limitations of `for`

• Have to initialize container variables;
• Can only iterate over simple structures (vectors and lists)
• Slow (sometimes)

## Getting started with `foreach`

The utility of `foreach` and it's syntax are different from `for`.

``````library(foreach)
``````
``````x <- foreach(i = 1:3) %do% sqrt(i)
x
``````
``````## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414
##
## [[3]]
## [1] 1.732
``````

Differences:
1. = instead of `in` for the index variable;
2. %do% to run a function described afterwards;
3. `foreach` returns the computed output for each iteration, by default, in a list.

This works a little like `lapply`

### Options for multiple iteration indices

You can input multiple indices with the following result.

``````x <- foreach(a = 1:10, b = c(25, 40)) %do% {
a + b
}
x
``````
``````## [[1]]
## [1] 26
##
## [[2]]
## [1] 42
``````

The indices were paired up and stopped when you run out of the shortest one.

### Example

Rewrite the following timed `for` loop with `foreach` and inspect the time needed.

``````i_squared <- integer(1e3)
system.time(
for(i in 1e3) {i_squared[i] <- i**2}
)
``````

This was faster with `foreach` but it doesn't scale very well as the iterator length increases.

## Options for combining the outputs

By default, output is returned as a list to accommodate any possible return value. However, the .combine option can tell `foreach` to return the outputs in another format.

``````x <- foreach(i = 1:5, .combine = "c") %do% log(i)
x
``````
``````## [1] 0.0000 0.6931 1.0986 1.3863 1.6094
``````

This returned a numeric vector like `sapply`. Each output is a single number, so applying c() will produce a numeric vector.

This is a flexible utility that can deal with vectors, too.

``````x <- foreach(i = 1:5, .combine = "cbind") %do% rnorm(3)
x
``````
``````##      result.1 result.2 result.3 result.4 result.5
## [1,]   0.8963   1.5228   0.9151  -0.2878   0.8880
## [2,]  -0.5545   0.6060   1.1870   0.3002  -0.9591
## [3,]  -0.3946  -0.2651  -0.9609   0.8679  -0.4606
``````
``````y <- foreach(i = 1:5, .combine = "rbind") %do% rnorm(3)
y
``````
``````##              [,1]    [,2]    [,3]
## result.1 -1.16722 -0.9703  0.3899
## result.2  0.06181 -0.1538  0.5110
## result.3 -0.62578  1.4343 -0.9803
## result.4  0.66444 -0.8928 -0.1510
## result.5  1.60457 -0.2227 -1.7464
``````

In this way we can create matrix output.

You can also do numeric processing on the outputs.

``````myfun <- function(a, b) {
c <- cbind(a, b)
apply(c, 1, mean)
}
foreach(i = 1:5, .combine = "myfun") %do% rnorm(3)
``````
``````## [1]  0.76257 -0.35699  0.01977
``````

# Iterators

At the heart of any looping structure is a variable that iterates sequencially through a collection of values. Any any place in the script it remembers what the current value is and can produce the next value to be processed. Besides being handy in uses of `foreach` this proves very useful in processing huge amounts of data in manageable chunks.

Almost all objects in R that contain data are indexed, hence can be made into iterators.

``````library(iterators)
i1 <- iter(1:20)
nextElem(i1)
``````
``````## [1] 1
``````
``````nextElem(i1)
``````
``````## [1] 2
``````

You can turn turn a matrix or data.frame into an iterator in which iteration is over rows or columns.

Consider the state.x77 matrix provided in the iterators package.

``````state.x77[1:3, 1:5]
``````
``````##         Population Income Illiteracy Life Exp Murder
## Alabama       3615   3624        2.1    69.05   15.1
## Alaska         365   6315        1.5    69.31   11.3
## Arizona       2212   4530        1.8    70.55    7.8
``````

Each row contains statistics for a state.

``````istate <- iter(state.x77, by = "row")
nextElem(istate)
``````
``````##         Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
``````
``````nextElem(istate)
``````
``````##        Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
``````

`iter` can also be applied to a function. Each element of the iterator is a new call to the function, often a random sampler.

``````ifun <- iter(function() sample(0:8, size = 9, replace = T))
nextElem(ifun)
``````
``````## [1] 6 1 5 3 4 5 8 5 7
``````
``````nextElem(ifun)
``````
``````## [1] 7 6 4 1 3 2 4 0 5
``````

The iterators package has some functions that make it easy to define common iterators.

``````a <- icount(20)  # an iterator running from 1 to 20
a <- irnorm(2, count = 5)  # 5 samplings of pairs from a normal distribution
``````

A very useful one we'll see later iterates over lines in a file.

You can combine this with foreach to write you're own apply-style function.

``````mat <- matrix(rnorm(1e+06), ncol = 10000)
dim(mat)
``````
``````## [1]   100 10000
``````
``````itx <- iter(mat, by = "row")
row_means <- foreach(i = itx, .combine = "c") %do% mean(i)
row_means[1:10]
``````
``````##  [1] -0.0216767  0.0006555  0.0013509  0.0051333  0.0081804 -0.0022249
##  [7]  0.0189809 -0.0100963 -0.0137118 -0.0059361
``````

We can parallelize this to make it faster when the function we apply to the row is intensive.

## Conditional execution

In some situations we may want to apply a function to one of the iteration values only when a certain condition is met. There is a way to add that on to `foreach`. This can be very handy when lots of data is stored in one object but we're only interested in a small percentage of it.

For example,

``````x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %do%
sqrt(i)
length(x)
``````
``````## [1] 11
``````
``````x
``````
``````##  [1] 1.0770 1.1556 1.0300 1.3946 0.6176 0.3800 0.5537 0.7717 0.7924 0.2969
## [11] 0.7355
``````

## Parallelization with multiple cores

A most frequent use of `foreach` is to execute many compute-intensive iterations in parallel. The simplest form of parallelisation is to use multiple cores in a single computer. A package to do this is doMC. After loading the library we need to register the workers before using it is a parallel processor.

``````library(doMC)
``````
``````## Loading required package: parallel
``````
``````registerDoMC()
getDoParWorkers()  # Checking the number of cores.
``````
``````## [1] 4
``````

### Parallel `foreach`

After initializing a parallel method, simply replace %do% by %dopar%

``````x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %dopar%
sqrt(i)
``````

Time the differences between the single processor and multicore implementations.

``````system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% sqrt(i))
``````
``````##    user  system elapsed
##   0.165   0.001   0.167
``````
``````system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% sqrt(i))
``````
``````##    user  system elapsed
##   0.162   0.030   0.177
``````

The parallel version actually took a little longer. If the core function takes a long time to run, this changes.

``````system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% {
Sys.sleep(0.01)
sqrt(i)
})
``````
``````##    user  system elapsed
##   0.474   0.029   6.062
``````
``````system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% {
Sys.sleep(0.01)
sqrt(i)
})
``````
``````##    user  system elapsed
##   0.218   0.072   1.617
``````

The parallel version is significantly fast.