Introduction to split-apply methods with plyr

As the complexity of problems increases, so does the length and width of the data and the questions we want to ask. Rarely, can this be accommodated smoothly with a singl;e data.frame and thepre-packaged functions. Yet, in order to function effectively as data analysts and scientists we need to focus our attention on the problem and not get bogged down in coding. Having the fastest and most conceptually natural methods for getting information from data is important.

Earlier we introduced the functions apply, lapply and sapply for simultaneously applying a function to the rows or columns of an array or the components of a list. As one multi-step application, we coded a long complex variable selection problem into a list of data.frames, then applied a significance test function to each list component. The resulting p-values were placed in a single vector and analyzed together. In short, the steps are:

The plyr library of functions enables us to do many problems like this involving arrays, data.frames and lists in a uniform way. There are also ways to parallelize the process and take almost no effort (but a little thought).

Variable-defined splitting

One feature of the plyr family that we have yet to see is the ability to divide a data.frame into sub-frames based on the value of a variable and then apply a function to the subframes. In fact, this is one of the most common uses of plyr. We illustrate with an example.

setwd("~/Documents/Computing with Data/15_split_apply")
flight_ontime_status_df <- read.csv("./Data/flight_ontime_status.csv")
str(flight_ontime_status_df)
## 'data.frame':    486133 obs. of  21 variables:
##  $ ORIGIN_AIRPORT_ID    : int  12478 12478 12478 12478 12478 12478 12478 12478 12478 12478 ...
##  $ ORIGIN_AIRPORT_SEQ_ID: int  1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 ...
##  $ ORIGIN_CITY_MARKET_ID: int  31703 31703 31703 31703 31703 31703 31703 31703 31703 31703 ...
##  $ ORIGIN_STATE_ABR     : Factor w/ 52 levels "AK","AL","AR",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ DEST_AIRPORT_ID      : int  12892 12892 12892 12892 12892 12892 12892 12892 12892 12892 ...
##  $ DEST_AIRPORT_SEQ_ID  : int  1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 ...
##  $ DEST_CITY_MARKET_ID  : int  32575 32575 32575 32575 32575 32575 32575 32575 32575 32575 ...
##  $ DEST_STATE_ABR       : Factor w/ 52 levels "AK","AL","AR",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ DEP_TIME             : int  855 921 931 904 858 911 902 855 858 852 ...
##  $ DEP_DELAY            : num  -5 21 31 4 -2 11 2 -5 -2 -8 ...
##  $ DEP_DELAY_NEW        : num  0 21 31 4 0 11 2 0 0 0 ...
##  $ ARR_TIME             : int  1142 1210 1224 1151 1142 1151 1203 1129 1127 1134 ...
##  $ ARR_DELAY            : num  -43 -15 -1 -34 -43 -34 -22 -56 -58 -51 ...
##  $ ARR_DELAY_NEW        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DISTANCE             : num  2475 2475 2475 2475 2475 ...
##  $ CARRIER_DELAY        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WEATHER_DELAY        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NAS_DELAY            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SECURITY_DELAY       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ LATE_AIRCRAFT_DELAY  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X                    : logi  NA NA NA NA NA NA ...

This data is at the level of individual flights. Exploratory analysis of the data will call for analysis of the data broken down by airport or state or time of day, etc. Getting at this information requires some reorganization of the data.

Problem: Compute summary statistics for arrival delays for planes for all airports with a reasonable number of flights

This is a little fuzzy, but that's common. Let's first find the number of flights per airport. Each row represents a flight, so the number of times a given airport code appears is the number of flights for that airport. We can get this information with the table function.

# Eliminate missing values in arrival delays
flight_ontime_status_df2 <- flight_ontime_status_df[!is.na(flight_ontime_status_df$ARR_DELAY_NEW), 
    ]
numb_flights_per_airport <- table(flight_ontime_status_df2$ORIGIN_AIRPORT_ID)
numb_flights_per_airport[1:5]
## 
## 10135 10136 10140 10146 10155 
##   245   198  2495    85     2
class(numb_flights_per_airport)
## [1] "table"
summary(as.numeric(numb_flights_per_airport))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     102     302    1670    1180   31100

Let's exclude the first quartile with only around 100 flights.

active_airports <- names(numb_flights_per_airport)[numb_flights_per_airport > 
    100]
# subset the flight data to this set of airports
flight_status_active_df <- subset(flight_ontime_status_df2, ORIGIN_AIRPORT_ID %in% 
    active_airports)

Example airport

Getting, say, mean delay times for an individual airport is easy.

flight_status_active_df$ORIGIN_AIRPORT_ID[1:3]
## [1] 12478 12478 12478
# Take this as the example airport
mean(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW, 
    na.rm = TRUE)
## [1] 8.475
# More informative is the summary statistics
summary(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     8.5     2.0   804.0

Looking at just one airport it's hard to tell if this is a lot.

Statistics over airports with daply and ddply

Load the library.

library(plyr)
mean_delays_by_airport <- daply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
    function(df) mean(df$ARR_DELAY_NEW, na.rm = T))
mean_delays_by_airport[1:5]
##  10135  10136  10140  10157  10185 
## 16.151  6.485  6.947 24.058 11.180
class(mean_delays_by_airport)
## [1] "numeric"
length(mean_delays_by_airport)
## [1] 216
summary(mean_delays_by_airport)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.30    7.39    8.95    9.87   11.70   24.10

What if we want all summary statistics for each one. We should get a matrix or data.frame of values.

summary_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
    function(df) summary(df$ARR_DELAY_NEW))
head(summary_delays_by_airport)
##   ORIGIN_AIRPORT_ID Min. 1st Qu. Median  Mean 3rd Qu. Max.
## 1             10135    0       0      0 16.20     5.0  527
## 2             10136    0       0      0  6.48     0.0  219
## 3             10140    0       0      0  6.95     2.0  374
## 4             10157    0       0      0 24.10    19.0  288
## 5             10185    0       0      0 11.20     4.5  202
## 6             10208    0       0      0 20.40    13.0  317
# Get the 90th and 95th percentile
top_few <- function(x) {
    quantile(x, probs = c(0.9, 0.95))
}
top_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
    function(df) top_few(df$ARR_DELAY_NEW))
head(top_delays_by_airport)
##   ORIGIN_AIRPORT_ID  90%   95%
## 1             10135 29.0  78.8
## 2             10136 13.9  28.3
## 3             10140 20.0  38.0
## 4             10157 85.0 143.0
## 5             10185 35.0  71.1
## 6             10208 56.6 121.5

Survey of the plyr family of functions

plyr provides 6 functions of the form

xzply

where x and z can be a, d or l. These letters are short for array, data.frame and list, respectively. The first slot, held by x, refers to the type of input, z refers to the format of the output. The format of how they work is as follows:

  1. list input The family lzply takes the form lzply(list, function). The function is applied to each component of the list.
  2. data.frame input The family dzply takes the form dzply(data, .(v1, v2, ...), function), where v1, v2, are variables. Here, the data.frame is subsetting by each combination of values of the variables to form a list of sub-data.frames. Then, the function is applied to each list component.
  3. array input The family azply takes the form azply(array, margins, function) where the array is split into sub-arrays determined by the specified margins, and then the function is applied to the sub-arrays

The value of z indicates what you do with the output of the functions; i.e., return a list of outputs, form into a new data.frame by stacking the output, or form into an array. Some options may be impossible, depending on the output of the function.

Working with lists

The function llply is largely similar to lapply. The function sapply can be replaced by laply, but the latter is more general.

la_ex1 <- list(a = 1:3, b = 8:20, c = rep(5, times = 10))
la_ex1
## $a
## [1] 1 2 3
## 
## $b
##  [1]  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## $c
##  [1] 5 5 5 5 5 5 5 5 5 5
laply(la_ex1, function(x) c(min(x), max(x)))
##      1  2
## [1,] 1  3
## [2,] 8 20
## [3,] 5  5

The output can be formed into a multidimensional array

The ldply function is appropriate if we want to caputure more information or the output yields a mixture of character and numeric data.

ldply(la_ex1, function(x) c(min(x), max(x)))
##   .id V1 V2
## 1   a  1  3
## 2   b  8 20
## 3   c  5  5

The data.frame output format includes a column of names of the list components, which can be useful.

Working with data frame input

df1 <- data.frame(V1 = paste("X", 1:10, sep = ""), V2 = c(letters[1:5], letters[1:5]), 
    V3 = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"), V4 = rnorm(10), 
    V5 = rnorm(10))
df1
##     V1 V2 V3       V4      V5
## 1   X1  a  A  0.34596 -0.9284
## 2   X2  b  A -2.08124  0.6062
## 3   X3  c  A  0.85163 -0.6130
## 4   X4  d  B -1.26626  0.4843
## 5   X5  e  B -0.62540  0.9086
## 6   X6  a  B  1.29213  0.0165
## 7   X7  b  C -0.07787  0.1515
## 8   X8  c  C -1.27464 -0.5080
## 9   X9  d  D -0.27981  1.7184
## 10 X10  e  D -0.95905 -0.4103

For insight into how the function creates sublists, let's group the data by value of V3.

df_by_V3_L <- dlply(df1, .(V3), function(df) df)
df_by_V3_L
## $A
##   V1 V2 V3      V4      V5
## 1 X1  a  A  0.3460 -0.9284
## 2 X2  b  A -2.0812  0.6062
## 3 X3  c  A  0.8516 -0.6130
## 
## $B
##   V1 V2 V3      V4     V5
## 1 X4  d  B -1.2663 0.4843
## 2 X5  e  B -0.6254 0.9086
## 3 X6  a  B  1.2921 0.0165
## 
## $C
##   V1 V2 V3       V4      V5
## 1 X7  b  C -0.07787  0.1515
## 2 X8  c  C -1.27464 -0.5080
## 
## $D
##    V1 V2 V3      V4      V5
## 1  X9  d  D -0.2798  1.7184
## 2 X10  e  D -0.9590 -0.4103
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   V3
## 1  A
## 2  B
## 3  C
## 4  D

Find the mean of V4 values by the V3 subgroups. This should be returnable as a vector; i.e., a 1-dimensional array.

mns_V4_by_V3 <- daply(df1, .(V3), function(df) mean(df$V4))
mns_V4_by_V3
##       A       B       C       D 
## -0.2945 -0.1998 -0.6763 -0.6194

Reports the group means. Now repeat this but find the means by combinations of V2 and V3 values.

daply(df1, .(V2, V3), function(df) mean(df$V4))
##    V3
## V2        A       B        C       D
##   a  0.3460  1.2921       NA      NA
##   b -2.0812      NA -0.07787      NA
##   c  0.8516      NA -1.27464      NA
##   d      NA -1.2663       NA -0.2798
##   e      NA -0.6254       NA -0.9590

Here, we report a 2-dimensional array; one for V2, one for V3. We can also output this as a data.frame, depending on what we want to do with it.

ddply(df1, .(V2, V3), function(df) mean(df$V4))
##    V2 V3       V1
## 1   a  A  0.34596
## 2   a  B  1.29213
## 3   b  A -2.08124
## 4   b  C -0.07787
## 5   c  A  0.85163
## 6   c  C -1.27464
## 7   d  B -1.26626
## 8   d  D -0.27981
## 9   e  B -0.62540
## 10  e  D -0.95905

We have a row for each nonempty combination of V2 and V3.

Working with array input

There are options for more than 2-dimensional arrays, but most uses are for matrices. Most commonly, the applications look a lot like uses of apply.

m_ex1 <- matrix(rnorm(20), nrow = 5, ncol = 4)
m_ex1
##         [,1]     [,2]    [,3]    [,4]
## [1,] -0.3251 -1.41294  1.6946  1.5352
## [2,] -0.4346  0.64337 -1.0930 -0.2257
## [3,]  0.3087 -0.08587 -1.5555 -0.4460
## [4,]  0.2503 -0.54574  0.0514 -0.7895
## [5,] -1.4014  0.21887 -0.5316 -1.1855
aaply(m_ex1, 1, sum)
##      1      2      3      4      5 
##  1.492 -1.110 -1.779 -1.034 -2.900

Conclusion

Most of the uses of plyr use a data.frame as input. It's usually wise to put the output into a data.frame, too, either by ddply or exporting an array that becomes a column or columns of a data.frame. This output can then be the input to further ddply applications, gplot plots, etc.