In doing practical problems, when there are many cases to consider or possible covariates, it's common to store all of the work into a list. Further anayleses of these results boils down to applying a function to each component of the list. R has a nice way of doing this that is similar to the apply
function for matrices.
lapply
The basic lapply
function takes two arguments. The first argument is a vector, either an atomic vector or a list. The second argument is function. lapply
returns a list as it's output. In the output list there is one component for each component of the input list and it's value is the result of applying the function to the input component.
output_list <- lapply(input_list, function)
Following is simulated data from a network of 100,000 environmental sensors. They take a reading of CO2 concentrations each day. However, they are configured so that if the concentration goes above a certain threshold, it records a reading every 6 hours until it goes down. Find the mean CO2 concentrations for each sensor. Secondly, compute the variance of all of these readings
The challenge here is that there may be a varying number of readings for each sensor, so arranging these in a matrix won't work. instead, we are given a list with 100,000 components, each one containing the readings from one sensor.
Let's read in the list of readings over a 28 day period.
setwd("~/Documents/Computing with Data/8_Functions_lists/")
The following generates the simulated data. This was only run once and the resulting objects saved to disk and then reloaded.
base_readings <- lapply(1:1e+05, function(z) {
sample(x = 292:335, size = 28, replace = T)
})
summary(base_readings[[1]])
number_extras <- sapply(1:1e+05, function(z) {
sample(x = 0:15, size = 1, replace = T)
})
number_extras[1:5]
more_readings <- lapply(1:1e+05, function(z) {
sample(x = 335:355, size = number_extras[z], replace = T)
})
co2_readings <- lapply(1:1e+05, function(z) {
c(base_readings[[z]], more_readings[[z]])
})
names(co2_readings) <- paste("S", 1:1e+05, sep = "")
save(co2_readings, file = "./Data/co2_readings.RData")
co2_readings2 <- co2_readings
co2_readings2[[25]] <- c(co2_readings2[[25]], NA)
save(co2_readings2, file = "./Data/co2_readings2.RData")
co2_readings3 <- co2_readings2
co2_readings3[[101]] <- co2_readings3[[999]] <- numeric(0)
save(co2_readings3, file = "./Data/co2_readings3.RData")
load("./Data/co2_readings.RData")
Finding the mean value of readings for one sensor at a time is easy:
mean(co2_readings[[1]])
## [1] 322.8
mean(co2_readings[[1000]])
## [1] 326.9
To compute the means for all of them we use lapply
as follows:
all_means_L <- lapply(co2_readings, mean)
all_means_L[1:2]
## $S1
## [1] 322.8
##
## $S2
## [1] 321.7
class(all_means_L)
## [1] "list"
This is a list in which each component is a single number. It is more natural and useful to represent this as a numeric vector of length 100,000. The function unlist
maps the list to an atomic vector by concatentating all of the component vectors.
all_means <- unlist(all_means_L)
all_means[1:2]
## S1 S2
## 322.8 321.7
class(all_means)
## [1] "numeric"
Now that we have a numeric vector of the means, we can answer the second question; i.e., what is the variance of these means?
var(all_means)
## [1] 15.23
Seldom is it enough to use a basic pre-defined function with only default options. Here are examples of other situations.
Select the first reading for each sensor.
Do the first sensor as a sample.
sens1 <- co2_readings[[1]]
sens1[1]
## [1] 312
# Define a function that selects the first entry for a vector of length at
# least 1 and NA otherwise
sel_first <- function(x) {
if (length(x) > 0) {
y <- x[1]
} else {
y <- NA
}
y
}
first_reads_L <- lapply(co2_readings, sel_first)
first_reads <- unlist(first_reads_L)
summary(first_reads)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 292 302 314 314 325 335
An example in which non-default options need to be set.
load("./Data/co2_readings2.RData")
This set of readings is nearly identical but there are some missing values for some sensors. We just ignore these in computing the mean.
all_means2_L <- lapply(co2_readings2, mean)
all_means2 <- unlist(all_means2_L)
sum(is.na(all_means2))
## [1] 1
In this form we lose all readings for that sensor. We can recover the non-missing ones as follows.
all_means3_L <- lapply(co2_readings2, function(x) {
mean(x, na.rm = T)
})
all_means3 <- unlist(all_means3_L)
sum(is.na(all_means3))
## [1] 0
sapply
is a useful shortcutWe've seen several instances in which we want to collapse the returned list into a vector of outputs. Using the function sapply
instead of lapply
does this in one step.
means_vector <- sapply(co2_readings, mean)
means_vector[1:2]
## S1 S2
## 322.8 321.7
If for some reason, for one of the input components, the function doesnt return a single number or character, sapply
will return a list.
lapply
It's tempting to simply iterate over the list and compute the means one component at a time. Let's see how the two alternatives compare in runtime.
means_v2 <- numeric(1e+05)
names(means_v2) <- names(co2_readings)
system.time(for (j in 1:1e+05) {
means_v2[j] <- mean(co2_readings[[j]])
})
## user system elapsed
## 0.820 0.004 0.823
system.time(all_means_L <- lapply(co2_readings, mean))
## user system elapsed
## 0.570 0.003 0.574
system.time(means_vector <- sapply(co2_readings, mean))
## user system elapsed
## 0.705 0.002 0.707
The lapply
version was considerably faster.
The lapply
function facilitates deeper study of the arrival/delay data as it depends on airline, day of the week, type of delay, etc.
First load the data from the homework set.
arrival_df <- read.csv("../Homework/ontime_flight_data/ONTIME1.csv", stringsAsFactors = F)
sapply(arrival_df, class) # Quick way to survey variables and their classes. A data.frame is a list!
## YEAR MONTH DAY_OF_MONTH
## "integer" "integer" "integer"
## AIRLINE_ID CARRIER ORIGIN_AIRPORT_ID
## "integer" "character" "integer"
## ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID DEST_AIRPORT_ID
## "integer" "integer" "integer"
## DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID ARR_TIME
## "integer" "integer" "integer"
## ARR_DELAY CARRIER_DELAY WEATHER_DELAY
## "numeric" "numeric" "numeric"
## NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY
## "numeric" "numeric" "numeric"
## X
## "logical"
Problem: Compute the rates of delay for each airline
First add a variable indicating there is a delay of some kind.
arrival_df2 <- transform(arrival_df, DELAY_LOGICAL = !is.na(CARRIER_DELAY))
What are the possible carriers?
unique(arrival_df2$CARRIER)
## [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX"
## [15] "WN" "YV"
length(unique(arrival_df2$CARRIER))
## [1] 16
We saw in the homework how to subset to a single carrier.
united_flights_df <- subset(arrival_df2, CARRIER == "UA")
The fraction of United flights with a delay is
sum(united_flights_df$DELAY_LOGICAL)/nrow(united_flights_df)
## [1] 0.1641
Write a function that will go through these steps for an arbitrary carrier.
delay_rate <- function(arriv_df, carr) {
carr_df <- subset(arriv_df, CARRIER == carr)
rt <- sum(carr_df$DELAY_LOGICAL)/nrow(carr_df)
rt
}
We want to apply this function to the arrival_df and each carrier. We need the vector of carriers for input to lapply
.
carrier_vector <- unique(arrival_df2$CARRIER)
Now apply the delay_rate function to each entry in the vector. The function outputs a single number, so we can use sapply
.
del_rates <- sapply(carrier_vector, function(c) delay_rate(arrival_df2, c))
del_rates
## 9E AA AS B6 DL EV F9 FL HA
## 0.17994 0.18487 0.12388 0.21046 0.11825 0.25179 0.28360 0.09965 0.07355
## MQ OO UA US VX WN YV
## 0.21726 0.19408 0.16413 0.16075 0.06785 0.14385 0.16653