Usual knocks on **R**

- It's slow
- Objects have to be loaded into memory to anaylze them.

Both of these problems make it of limited use for handling big data and situations where millions of calculations need to be done.

Libraries like `foreach`

, `iterator`

and `ff`

together with packages to facilitate parallelization are relieving the deficiencies.

`for`

loop```
for (i in 1:5) {
print(i^2)
}
```

```
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
```

```
i_squared <- integer(5)
for (i in 1:5) {
i_squared[i] <- i^2
}
i_squared
```

```
## [1] 1 4 9 16 25
```

```
i_squared <- integer(1e+09)
system.time(for (i in 1e+09) {
i_squared[i] <- i^2
})
```

```
## user system elapsed
## 2.834 2.737 6.005
```

`for`

- Have to initialize container variables;
- Can only iterate over simple structures (vectors and lists)
- Slow (sometimes)

`foreach`

The utility of `foreach`

and it's syntax are different from `for`

.

```
library(foreach)
```

```
x <- foreach(i = 1:3) %do% sqrt(i)
x
```

```
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414
##
## [[3]]
## [1] 1.732
```

Differences:

1. = instead of `in`

for the index variable;

2. %do% to run a function described afterwards;

3. `foreach`

returns the computed output for each iteration, by default, in a list.

This works a little like `lapply`

You can input multiple indices with the following result.

```
x <- foreach(a = 1:10, b = c(25, 40)) %do% {
a + b
}
x
```

```
## [[1]]
## [1] 26
##
## [[2]]
## [1] 42
```

The indices were paired up and stopped when you run out of the shortest one.

Rewrite the following timed `for`

loop with `foreach`

and inspect the time needed.

```
i_squared <- integer(1e3)
system.time(
for(i in 1e3) {i_squared[i] <- i**2}
)
```

This was faster with `foreach`

but it doesn't scale very well as the iterator length increases.

By default, output is returned as a list to accommodate any possible return value. However, the .combine option can tell `foreach`

to return the outputs in another format.

```
x <- foreach(i = 1:5, .combine = "c") %do% log(i)
x
```

```
## [1] 0.0000 0.6931 1.0986 1.3863 1.6094
```

This returned a numeric vector like `sapply`

. Each output is a single number, so applying c() will produce a numeric vector.

This is a flexible utility that can deal with vectors, too.

```
x <- foreach(i = 1:5, .combine = "cbind") %do% rnorm(3)
x
```

```
## result.1 result.2 result.3 result.4 result.5
## [1,] 0.8963 1.5228 0.9151 -0.2878 0.8880
## [2,] -0.5545 0.6060 1.1870 0.3002 -0.9591
## [3,] -0.3946 -0.2651 -0.9609 0.8679 -0.4606
```

```
y <- foreach(i = 1:5, .combine = "rbind") %do% rnorm(3)
y
```

```
## [,1] [,2] [,3]
## result.1 -1.16722 -0.9703 0.3899
## result.2 0.06181 -0.1538 0.5110
## result.3 -0.62578 1.4343 -0.9803
## result.4 0.66444 -0.8928 -0.1510
## result.5 1.60457 -0.2227 -1.7464
```

In this way we can create matrix output.

You can also do numeric processing on the outputs.

```
myfun <- function(a, b) {
c <- cbind(a, b)
apply(c, 1, mean)
}
foreach(i = 1:5, .combine = "myfun") %do% rnorm(3)
```

```
## [1] 0.76257 -0.35699 0.01977
```

At the heart of any looping structure is a variable that iterates sequencially through a collection of values. Any any place in the script it *remembers* what the current value is and can produce the next value to be processed. Besides being handy in uses of `foreach`

this proves very useful in processing huge amounts of data in manageable chunks.

Almost all objects in **R** that contain data are indexed, hence can be made into iterators.

```
library(iterators)
i1 <- iter(1:20)
nextElem(i1)
```

```
## [1] 1
```

```
nextElem(i1)
```

```
## [1] 2
```

You can turn turn a matrix or data.frame into an iterator in which iteration is over rows or columns.

Consider the state.x77 matrix provided in the iterators package.

```
state.x77[1:3, 1:5]
```

```
## Population Income Illiteracy Life Exp Murder
## Alabama 3615 3624 2.1 69.05 15.1
## Alaska 365 6315 1.5 69.31 11.3
## Arizona 2212 4530 1.8 70.55 7.8
```

Each row contains statistics for a state.

```
istate <- iter(state.x77, by = "row")
nextElem(istate)
```

```
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
```

```
nextElem(istate)
```

```
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
```

`iter`

can also be applied to a function. Each element of the iterator is a new call to the function, often a random sampler.

```
ifun <- iter(function() sample(0:8, size = 9, replace = T))
nextElem(ifun)
```

```
## [1] 6 1 5 3 4 5 8 5 7
```

```
nextElem(ifun)
```

```
## [1] 7 6 4 1 3 2 4 0 5
```

The iterators package has some functions that make it easy to define common iterators.

```
a <- icount(20) # an iterator running from 1 to 20
a <- irnorm(2, count = 5) # 5 samplings of pairs from a normal distribution
```

A very useful one we'll see later iterates over lines in a file.

You can combine this with foreach to write you're own apply-style function.

```
mat <- matrix(rnorm(1e+06), ncol = 10000)
dim(mat)
```

```
## [1] 100 10000
```

```
itx <- iter(mat, by = "row")
row_means <- foreach(i = itx, .combine = "c") %do% mean(i)
row_means[1:10]
```

```
## [1] -0.0216767 0.0006555 0.0013509 0.0051333 0.0081804 -0.0022249
## [7] 0.0189809 -0.0100963 -0.0137118 -0.0059361
```

We can parallelize this to make it faster when the function we apply to the row is intensive.

In some situations we may want to apply a function to one of the iteration values only when a certain condition is met. There is a way to add that on to `foreach`

. This can be very handy when lots of data is stored in one object but we're only interested in a small percentage of it.

For example,

```
x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %do%
sqrt(i)
length(x)
```

```
## [1] 11
```

```
x
```

```
## [1] 1.0770 1.1556 1.0300 1.3946 0.6176 0.3800 0.5537 0.7717 0.7924 0.2969
## [11] 0.7355
```

A most frequent use of `foreach`

is to execute many compute-intensive iterations in parallel. The simplest form of parallelisation is to use multiple cores in a single computer. A package to do this is **doMC**. After loading the library we need to register the workers before using it is a parallel processor.

```
library(doMC)
```

```
## Loading required package: parallel
```

```
registerDoMC()
getDoParWorkers() # Checking the number of cores.
```

```
## [1] 4
```

`foreach`

After initializing a parallel method, simply replace %do% by %dopar%

```
x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %dopar%
sqrt(i)
```

Time the differences between the single processor and multicore implementations.

```
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% sqrt(i))
```

```
## user system elapsed
## 0.165 0.001 0.167
```

```
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% sqrt(i))
```

```
## user system elapsed
## 0.162 0.030 0.177
```

The parallel version actually took a little longer. If the core function takes a long time to run, this changes.

```
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% {
Sys.sleep(0.01)
sqrt(i)
})
```

```
## user system elapsed
## 0.474 0.029 6.062
```

```
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% {
Sys.sleep(0.01)
sqrt(i)
})
```

```
## user system elapsed
## 0.218 0.072 1.617
```

The parallel version is significantly fast.