Usual knocks on R
Both of these problems make it of limited use for handling big data and situations where millions of calculations need to be done.
Libraries like foreach
, iterator
and ff
together with packages to facilitate parallelization are relieving the deficiencies.
for
loopfor (i in 1:5) {
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
i_squared <- integer(5)
for (i in 1:5) {
i_squared[i] <- i^2
}
i_squared
## [1] 1 4 9 16 25
i_squared <- integer(1e+09)
system.time(for (i in 1e+09) {
i_squared[i] <- i^2
})
## user system elapsed
## 2.834 2.737 6.005
for
foreach
The utility of foreach
and it's syntax are different from for
.
library(foreach)
x <- foreach(i = 1:3) %do% sqrt(i)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414
##
## [[3]]
## [1] 1.732
Differences:
1. = instead of in
for the index variable;
2. %do% to run a function described afterwards;
3. foreach
returns the computed output for each iteration, by default, in a list.
This works a little like lapply
You can input multiple indices with the following result.
x <- foreach(a = 1:10, b = c(25, 40)) %do% {
a + b
}
x
## [[1]]
## [1] 26
##
## [[2]]
## [1] 42
The indices were paired up and stopped when you run out of the shortest one.
Rewrite the following timed for
loop with foreach
and inspect the time needed.
i_squared <- integer(1e3)
system.time(
for(i in 1e3) {i_squared[i] <- i**2}
)
This was faster with foreach
but it doesn't scale very well as the iterator length increases.
By default, output is returned as a list to accommodate any possible return value. However, the .combine option can tell foreach
to return the outputs in another format.
x <- foreach(i = 1:5, .combine = "c") %do% log(i)
x
## [1] 0.0000 0.6931 1.0986 1.3863 1.6094
This returned a numeric vector like sapply
. Each output is a single number, so applying c() will produce a numeric vector.
This is a flexible utility that can deal with vectors, too.
x <- foreach(i = 1:5, .combine = "cbind") %do% rnorm(3)
x
## result.1 result.2 result.3 result.4 result.5
## [1,] 0.8963 1.5228 0.9151 -0.2878 0.8880
## [2,] -0.5545 0.6060 1.1870 0.3002 -0.9591
## [3,] -0.3946 -0.2651 -0.9609 0.8679 -0.4606
y <- foreach(i = 1:5, .combine = "rbind") %do% rnorm(3)
y
## [,1] [,2] [,3]
## result.1 -1.16722 -0.9703 0.3899
## result.2 0.06181 -0.1538 0.5110
## result.3 -0.62578 1.4343 -0.9803
## result.4 0.66444 -0.8928 -0.1510
## result.5 1.60457 -0.2227 -1.7464
In this way we can create matrix output.
You can also do numeric processing on the outputs.
myfun <- function(a, b) {
c <- cbind(a, b)
apply(c, 1, mean)
}
foreach(i = 1:5, .combine = "myfun") %do% rnorm(3)
## [1] 0.76257 -0.35699 0.01977
At the heart of any looping structure is a variable that iterates sequencially through a collection of values. Any any place in the script it remembers what the current value is and can produce the next value to be processed. Besides being handy in uses of foreach
this proves very useful in processing huge amounts of data in manageable chunks.
Almost all objects in R that contain data are indexed, hence can be made into iterators.
library(iterators)
i1 <- iter(1:20)
nextElem(i1)
## [1] 1
nextElem(i1)
## [1] 2
You can turn turn a matrix or data.frame into an iterator in which iteration is over rows or columns.
Consider the state.x77 matrix provided in the iterators package.
state.x77[1:3, 1:5]
## Population Income Illiteracy Life Exp Murder
## Alabama 3615 3624 2.1 69.05 15.1
## Alaska 365 6315 1.5 69.31 11.3
## Arizona 2212 4530 1.8 70.55 7.8
Each row contains statistics for a state.
istate <- iter(state.x77, by = "row")
nextElem(istate)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
nextElem(istate)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
iter
can also be applied to a function. Each element of the iterator is a new call to the function, often a random sampler.
ifun <- iter(function() sample(0:8, size = 9, replace = T))
nextElem(ifun)
## [1] 6 1 5 3 4 5 8 5 7
nextElem(ifun)
## [1] 7 6 4 1 3 2 4 0 5
The iterators package has some functions that make it easy to define common iterators.
a <- icount(20) # an iterator running from 1 to 20
a <- irnorm(2, count = 5) # 5 samplings of pairs from a normal distribution
A very useful one we'll see later iterates over lines in a file.
You can combine this with foreach to write you're own apply-style function.
mat <- matrix(rnorm(1e+06), ncol = 10000)
dim(mat)
## [1] 100 10000
itx <- iter(mat, by = "row")
row_means <- foreach(i = itx, .combine = "c") %do% mean(i)
row_means[1:10]
## [1] -0.0216767 0.0006555 0.0013509 0.0051333 0.0081804 -0.0022249
## [7] 0.0189809 -0.0100963 -0.0137118 -0.0059361
We can parallelize this to make it faster when the function we apply to the row is intensive.
In some situations we may want to apply a function to one of the iteration values only when a certain condition is met. There is a way to add that on to foreach
. This can be very handy when lots of data is stored in one object but we're only interested in a small percentage of it.
For example,
x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %do%
sqrt(i)
length(x)
## [1] 11
x
## [1] 1.0770 1.1556 1.0300 1.3946 0.6176 0.3800 0.5537 0.7717 0.7924 0.2969
## [11] 0.7355
A most frequent use of foreach
is to execute many compute-intensive iterations in parallel. The simplest form of parallelisation is to use multiple cores in a single computer. A package to do this is doMC. After loading the library we need to register the workers before using it is a parallel processor.
library(doMC)
## Loading required package: parallel
registerDoMC()
getDoParWorkers() # Checking the number of cores.
## [1] 4
foreach
After initializing a parallel method, simply replace %do% by %dopar%
x <- foreach(i = irnorm(1, count = 20), .combine = "c") %:% when(i >= 0) %dopar%
sqrt(i)
Time the differences between the single processor and multicore implementations.
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% sqrt(i))
## user system elapsed
## 0.165 0.001 0.167
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% sqrt(i))
## user system elapsed
## 0.162 0.030 0.177
The parallel version actually took a little longer. If the core function takes a long time to run, this changes.
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %do% {
Sys.sleep(0.01)
sqrt(i)
})
## user system elapsed
## 0.474 0.029 6.062
system.time(x <- foreach(i = irnorm(1, count = 1000), .combine = "c") %:% when(i >=
0) %dopar% {
Sys.sleep(0.01)
sqrt(i)
})
## user system elapsed
## 0.218 0.072 1.617
The parallel version is significantly fast.