Applying a function to the components of a list

In doing practical problems, when there are many cases to consider or possible covariates, it's common to store all of the work into a list. Further anayleses of these results boils down to applying a function to each component of the list. R has a nice way of doing this that is similar to the apply function for matrices.

Format of an `lapply`

The basic lapply function takes two arguments. The first argument is a vector, either an atomic vector or a list. The second argument is function. lapply returns a list as it's output. In the output list there is one component for each component of the input list and it's value is the result of applying the function to the input component.

output_list <- lapply(input_list, function)

Example problem with sensor readings

Following is simulated data from a network of 100,000 environmental sensors. They take a reading of CO2 concentrations each day. However, they are configured so that if the concentration goes above a certain threshold, it records a reading every 6 hours until it goes down. Find the mean CO2 concentrations for each sensor. Secondly, compute the variance of all of these readings

The challenge here is that there may be a varying number of readings for each sensor, so arranging these in a matrix won't work. instead, we are given a list with 100,000 components, each one containing the readings from one sensor.

Let's read in the list of readings over a 28 day period.

setwd("~/Documents/Computing with Data/8_Functions_lists/")

The following generates the simulated data. This was only run once and the resulting objects saved to disk and then reloaded.

base_readings <- lapply(1:1e+05, function(z) {
    sample(x = 292:335, size = 28, replace = T)
})
summary(base_readings[[1]])
number_extras <- sapply(1:1e+05, function(z) {
    sample(x = 0:15, size = 1, replace = T)
})
number_extras[1:5]
more_readings <- lapply(1:1e+05, function(z) {
    sample(x = 335:355, size = number_extras[z], replace = T)
})
co2_readings <- lapply(1:1e+05, function(z) {
    c(base_readings[[z]], more_readings[[z]])
})
names(co2_readings) <- paste("S", 1:1e+05, sep = "")
save(co2_readings, file = "./Data/co2_readings.RData")
co2_readings2 <- co2_readings
co2_readings2[[25]] <- c(co2_readings2[[25]], NA)
save(co2_readings2, file = "./Data/co2_readings2.RData")
co2_readings3 <- co2_readings2
co2_readings3[[101]] <- co2_readings3[[999]] <- numeric(0)
save(co2_readings3, file = "./Data/co2_readings3.RData")

load("./Data/co2_readings.RData")

Finding the mean value of readings for one sensor at a time is easy:

mean(co2_readings[[1]])

## [1] 322.8

mean(co2_readings[[1000]])

## [1] 326.9

To compute the means for all of them we use lapply as follows:

all_means_L <- lapply(co2_readings, mean)
all_means_L[1:2]

## $S1
## [1] 322.8
## 
## $S2
## [1] 321.7

class(all_means_L)

## [1] "list"

This is a list in which each component is a single number. It is more natural and useful to represent this as a numeric vector of length 100,000. The function unlist maps the list to an atomic vector by concatentating all of the component vectors.

all_means <- unlist(all_means_L)
all_means[1:2]

##    S1    S2 
## 322.8 321.7

class(all_means)

## [1] "numeric"

Now that we have a numeric vector of the means, we can answer the second question; i.e., what is the variance of these means?

var(all_means)

## [1] 15.23

More complex functions

Seldom is it enough to use a basic pre-defined function with only default options. Here are examples of other situations.

Select the first reading for each sensor.

Do the first sensor as a sample.

sens1 <- co2_readings[[1]]
sens1[1]

## [1] 312

# Define a function that selects the first entry for a vector of length at
# least 1 and NA otherwise
sel_first <- function(x) {
    if (length(x) > 0) {
        y <- x[1]
    } else {
        y <- NA
    }
    y
}
first_reads_L <- lapply(co2_readings, sel_first)
first_reads <- unlist(first_reads_L)
summary(first_reads)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     292     302     314     314     325     335

An example in which non-default options need to be set.

load("./Data/co2_readings2.RData")

This set of readings is nearly identical but there are some missing values for some sensors. We just ignore these in computing the mean.

all_means2_L <- lapply(co2_readings2, mean)
all_means2 <- unlist(all_means2_L)
sum(is.na(all_means2))

## [1] 1

In this form we lose all readings for that sensor. We can recover the non-missing ones as follows.

all_means3_L <- lapply(co2_readings2, function(x) {
    mean(x, na.rm = T)
})
all_means3 <- unlist(all_means3_L)
sum(is.na(all_means3))

## [1] 0

`sapply` is a useful shortcut

We've seen several instances in which we want to collapse the returned list into a vector of outputs. Using the function sapply instead of lapply does this in one step.

means_vector <- sapply(co2_readings, mean)
means_vector[1:2]

##    S1    S2 
## 322.8 321.7

If for some reason, for one of the input components, the function doesnt return a single number or character, sapply will return a list.

Execution time for `lapply`

It's tempting to simply iterate over the list and compute the means one component at a time. Let's see how the two alternatives compare in runtime.

means_v2 <- numeric(1e+05)
names(means_v2) <- names(co2_readings)
system.time(for (j in 1:1e+05) {
    means_v2[j] <- mean(co2_readings[[j]])
})

##    user  system elapsed 
##   0.820   0.004   0.823

system.time(all_means_L <- lapply(co2_readings, mean))

##    user  system elapsed 
##   0.570   0.003   0.574

system.time(means_vector <- sapply(co2_readings, mean))

##    user  system elapsed 
##   0.705   0.002   0.707

The lapply version was considerably faster.

Deeper analysis of flight arrival data

The lapply function facilitates deeper study of the arrival/delay data as it depends on airline, day of the week, type of delay, etc.

First load the data from the homework set.

arrival_df <- read.csv("../Homework/ontime_flight_data/ONTIME1.csv", stringsAsFactors = F)
sapply(arrival_df, class)  # Quick way to survey variables and their classes. A data.frame is a list!

##                  YEAR                 MONTH          DAY_OF_MONTH 
##             "integer"             "integer"             "integer" 
##            AIRLINE_ID               CARRIER     ORIGIN_AIRPORT_ID 
##             "integer"           "character"             "integer" 
## ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID       DEST_AIRPORT_ID 
##             "integer"             "integer"             "integer" 
##   DEST_AIRPORT_SEQ_ID   DEST_CITY_MARKET_ID              ARR_TIME 
##             "integer"             "integer"             "integer" 
##             ARR_DELAY         CARRIER_DELAY         WEATHER_DELAY 
##             "numeric"             "numeric"             "numeric" 
##             NAS_DELAY        SECURITY_DELAY   LATE_AIRCRAFT_DELAY 
##             "numeric"             "numeric"             "numeric" 
##                     X 
##             "logical"

Problem: Compute the rates of delay for each airline

First add a variable indicating there is a delay of some kind.

arrival_df2 <- transform(arrival_df, DELAY_LOGICAL = !is.na(CARRIER_DELAY))

What are the possible carriers?

unique(arrival_df2$CARRIER)

##  [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX"
## [15] "WN" "YV"

length(unique(arrival_df2$CARRIER))

## [1] 16

We saw in the homework how to subset to a single carrier.

united_flights_df <- subset(arrival_df2, CARRIER == "UA")

The fraction of United flights with a delay is

sum(united_flights_df$DELAY_LOGICAL)/nrow(united_flights_df)

## [1] 0.1641

Write a function that will go through these steps for an arbitrary carrier.

delay_rate <- function(arriv_df, carr) {
    carr_df <- subset(arriv_df, CARRIER == carr)
    rt <- sum(carr_df$DELAY_LOGICAL)/nrow(carr_df)
    rt
}

We want to apply this function to the arrival_df and each carrier. We need the vector of carriers for input to lapply.

carrier_vector <- unique(arrival_df2$CARRIER)

Now apply the delay_rate function to each entry in the vector. The function outputs a single number, so we can use sapply.

del_rates <- sapply(carrier_vector, function(c) delay_rate(arrival_df2, c))
del_rates

##      9E      AA      AS      B6      DL      EV      F9      FL      HA 
## 0.17994 0.18487 0.12388 0.21046 0.11825 0.25179 0.28360 0.09965 0.07355 
##      MQ      OO      UA      US      VX      WN      YV 
## 0.21726 0.19408 0.16413 0.16075 0.06785 0.14385 0.16653