As the complexity of problems increases, so does the length and width of the data and the questions we want to ask. Rarely, can this be accommodated smoothly with a singl;e data.frame and the pre-packaged functions. Yet, in order to function effectively as data analysts and scientists we need to focus our attention on the problem and not get bogged down in coding. Having the fastest and most conceptually natural methods for getting information from data is important.

Earlier we introduced the functions apply, lapply and sapply for simultaneously applying a function to the rows or columns of an array or the components of a list. As one multi-step application, we coded a long complex variable selection problem into a list of data.frames, then applied a significance test function to each list component. The resulting p-values were placed in a single vector and analyzed together. In short, the steps are:

The plyr library of functions enables us to do many problems like this involving arrays, data.frames and lists in a uniform way. There are also ways to parallelize the process that takes almost no effort (but a little thought).

Variable-defined splitting

One feature of the plyr family that we have yet to see is the ability to divide a data.frame into sub-frames based on the value of a variable and then apply a function to the subframes. In fact, this is one of the most common uses of plyr. We illustrate with an example.

setwd("~/Documents/Computing with Data/15_split_apply")
flight_ontime_status_df <- read.csv("./Data/flight_ontime_status.csv")
str(flight_ontime_status_df)
## 'data.frame':    486133 obs. of  21 variables:
##  $ ORIGIN_AIRPORT_ID    : int  12478 12478 12478 12478 12478 12478 12478 12478 12478 12478 ...
##  $ ORIGIN_AIRPORT_SEQ_ID: int  1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 ...
##  $ ORIGIN_CITY_MARKET_ID: int  31703 31703 31703 31703 31703 31703 31703 31703 31703 31703 ...
##  $ ORIGIN_STATE_ABR     : Factor w/ 52 levels "AK","AL","AR",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ DEST_AIRPORT_ID      : int  12892 12892 12892 12892 12892 12892 12892 12892 12892 12892 ...
##  $ DEST_AIRPORT_SEQ_ID  : int  1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 ...
##  $ DEST_CITY_MARKET_ID  : int  32575 32575 32575 32575 32575 32575 32575 32575 32575 32575 ...
##  $ DEST_STATE_ABR       : Factor w/ 52 levels "AK","AL","AR",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ DEP_TIME             : int  855 921 931 904 858 911 902 855 858 852 ...
##  $ DEP_DELAY            : num  -5 21 31 4 -2 11 2 -5 -2 -8 ...
##  $ DEP_DELAY_NEW        : num  0 21 31 4 0 11 2 0 0 0 ...
##  $ ARR_TIME             : int  1142 1210 1224 1151 1142 1151 1203 1129 1127 1134 ...
##  $ ARR_DELAY            : num  -43 -15 -1 -34 -43 -34 -22 -56 -58 -51 ...
##  $ ARR_DELAY_NEW        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DISTANCE             : num  2475 2475 2475 2475 2475 ...
##  $ CARRIER_DELAY        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WEATHER_DELAY        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NAS_DELAY            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SECURITY_DELAY       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ LATE_AIRCRAFT_DELAY  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X                    : logi  NA NA NA NA NA NA ...

This data is at the level of individual flights. Exploratory analysis of the data will call for analysis of the data broken down by airport or state or time of day, etc. Getting at this information requires some reorganization of the data.

Problem: Compute summary statistics of all arrival delays for all airports with a reasonable number of flights

This is a little fuzzy, but that’s common. Let’s first find the number of flights per airport. Each row represents a flight, so the number of times a given airport code appears is the number of flights for that airport. We can get this information with the table function.

# Eliminate missing values in arrival delays
flight_ontime_status_df2 <- flight_ontime_status_df[!is.na(flight_ontime_status_df$ARR_DELAY_NEW),]
numb_flights_per_airport <- table(flight_ontime_status_df2$ORIGIN_AIRPORT_ID)
numb_flights_per_airport[1:5]
## 
## 10135 10136 10140 10146 10155 
##   245   198  2495    85     2
class(numb_flights_per_airport)
## [1] "table"
summary(as.numeric(numb_flights_per_airport))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     102     302    1670    1180   31100

Let’s exclude the first quartile with only around 100 flights.

active_airports <- names(numb_flights_per_airport)[numb_flights_per_airport > 100]
# subset the flight data to this set of airports
flight_status_active_df <- subset(flight_ontime_status_df2, ORIGIN_AIRPORT_ID %in% active_airports)

Example airport

Getting, say, mean delay times for an individual airport is easy.

flight_status_active_df$ORIGIN_AIRPORT_ID[1:3]
## [1] 12478 12478 12478
# Take this as the example airport
mean(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW, na.rm=TRUE)
## [1] 8.475
# More informative is the summary statistics
summary(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     8.5     2.0   804.0

Looking at just one airport it’s hard to tell if this is a lot.

Step-by-step split-apply

How do we compute the summary statistics of delays for all airports?

First, create a list of data.frames, each component corresponding to an airport and each data.frame the flights originating from that airport. We use lapply.

arprt_flights_L <- lapply(active_airports, function(x) {
  subset(flight_status_active_df, ORIGIN_AIRPORT_ID == x)
  })
names(arprt_flights_L) <- active_airports
names(arprt_flights_L)[1:3]
## [1] "10135" "10136" "10140"
arprt_flights_L[["10135"]][1:3, 10:17]
##        DEP_DELAY DEP_DELAY_NEW ARR_TIME ARR_DELAY ARR_DELAY_NEW DISTANCE
## 147708        -9             0     1429       -12             0      425
## 147709        -8             0     1413       -28             0      425
## 147710        -7             0     1423       -18             0      425
##        CARRIER_DELAY WEATHER_DELAY
## 147708            NA            NA
## 147709            NA            NA
## 147710            NA            NA

Compute the mean of the variable ARR_DELAY_NEW for each data.frame.

means_by_airport_L <- lapply(arprt_flights_L, function(df) mean(df$ARR_DELAY_NEW, na.rm=T))
names(means_by_airport_L) <- active_airports
means_by_airport_L[1:3]
## $`10135`
## [1] 16.15
## 
## $`10136`
## [1] 6.485
## 
## $`10140`
## [1] 6.947

Create a nice vector of the output.

means_by_airport <- unlist(means_by_airport_L)
means_by_airport[1:3]
##  10135  10136  10140 
## 16.151  6.485  6.947

Statistics over airports with daply and ddply

Here is a compact way to do all of these steps.

Load the library.

library(plyr)

The daply function does all of the steps in one with no intermediate objects.

mean_delays_by_airport <- daply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
                                function(df) mean(df$ARR_DELAY_NEW, na.rm=T))
mean_delays_by_airport[1:5]
##  10135  10136  10140  10157  10185 
## 16.151  6.485  6.947 24.058 11.180
class(mean_delays_by_airport)
## [1] "numeric"
length(mean_delays_by_airport)
## [1] 216
summary(mean_delays_by_airport)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.30    7.39    8.95    9.87   11.70   24.10

What if we want all summary statistics for each one. We should get a matrix or data.frame of values.

summary_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
                                   function(df) summary(df$ARR_DELAY_NEW))
head(summary_delays_by_airport)
##   ORIGIN_AIRPORT_ID Min. 1st Qu. Median  Mean 3rd Qu. Max.
## 1             10135    0       0      0 16.20     5.0  527
## 2             10136    0       0      0  6.48     0.0  219
## 3             10140    0       0      0  6.95     2.0  374
## 4             10157    0       0      0 24.10    19.0  288
## 5             10185    0       0      0 11.20     4.5  202
## 6             10208    0       0      0 20.40    13.0  317
# Get the 90th and 95th percentile
top_few <- function(x) {quantile(x, probs=c(0.9, 0.95))}
top_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID), 
                                   function(df) top_few(df$ARR_DELAY_NEW))
head(top_delays_by_airport)
##   ORIGIN_AIRPORT_ID  90%   95%
## 1             10135 29.0  78.8
## 2             10136 13.9  28.3
## 3             10140 20.0  38.0
## 4             10157 85.0 143.0
## 5             10185 35.0  71.1
## 6             10208 56.6 121.5

Survey of the plyr family of functions

daply, ddply or dlply

data.frame input The family dzply takes the form dzply(data, .(v1, v2, …), function), where v1, v2, are variables. Here, the data.frame is subsetting by each combination of values of the variables to form a list of sub-data.frames. Then, the function is applied to each list component. output array or data.frame

The value of z indicates what you do with the output of the functions; i.e., return a list of outputs, form into a new data.frame by stacking the output, or form into an array. Some options may be impossible, depending on the output of the function.

Working with data frame input

df1 <- data.frame(V1 = paste("X", 1:10, sep=""), V2=c(letters[1:5], letters[1:5]), 
                  V3=c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"), V4 = rnorm(10), V5=rnorm(10))
df1
##     V1 V2 V3       V4       V5
## 1   X1  a  A  0.78963  0.84490
## 2   X2  b  A -0.84284 -0.68019
## 3   X3  c  A -0.49802  0.16426
## 4   X4  d  B -0.33562  1.06401
## 5   X5  e  B -0.42798 -1.31098
## 6   X6  a  B -0.14044  0.54070
## 7   X7  b  C  0.01645 -1.45668
## 8   X8  c  C  0.41199 -2.14611
## 9   X9  d  D -0.76436 -0.02588
## 10 X10  e  D -1.81736  1.28319

For insight into how the function creates sublists, let’s group the data by value of V3.

df_by_V3_L <- dlply(df1, .(V3), function(df) df)
df_by_V3_L
## $A
##   V1 V2 V3      V4      V5
## 1 X1  a  A  0.7896  0.8449
## 2 X2  b  A -0.8428 -0.6802
## 3 X3  c  A -0.4980  0.1643
## 
## $B
##   V1 V2 V3      V4      V5
## 1 X4  d  B -0.3356  1.0640
## 2 X5  e  B -0.4280 -1.3110
## 3 X6  a  B -0.1404  0.5407
## 
## $C
##   V1 V2 V3      V4     V5
## 1 X7  b  C 0.01645 -1.457
## 2 X8  c  C 0.41199 -2.146
## 
## $D
##    V1 V2 V3      V4       V5
## 1  X9  d  D -0.7644 -0.02588
## 2 X10  e  D -1.8174  1.28319
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   V3
## 1  A
## 2  B
## 3  C
## 4  D

Find the mean of V4 values by the V3 subgroups. This should be returnable as a vector; i.e., a 1-dimensional array.

mns_V4_by_V3 <- daply(df1, .(V3), function(df) mean(df$V4))
mns_V4_by_V3
##       A       B       C       D 
## -0.1837 -0.3013  0.2142 -1.2909

Reports the group means. Now repeat this but find the means by combinations of V2 and V3 values.

daply(df1, .(V2,V3), function(df) mean(df$V4))
##    V3
## V2        A       B       C       D
##   a  0.7896 -0.1404      NA      NA
##   b -0.8428      NA 0.01645      NA
##   c -0.4980      NA 0.41199      NA
##   d      NA -0.3356      NA -0.7644
##   e      NA -0.4280      NA -1.8174

Here, we report a 2-dimensional array; one for V2, one for V3. We can also output this as a data.frame, depending on what we want to do with it.

ddply(df1, .(V2,V3), function(df) mean(df$V4))
##    V2 V3       V1
## 1   a  A  0.78963
## 2   a  B -0.14044
## 3   b  A -0.84284
## 4   b  C  0.01645
## 5   c  A -0.49802
## 6   c  C  0.41199
## 7   d  B -0.33562
## 8   d  D -0.76436
## 9   e  B -0.42798
## 10  e  D -1.81736

We have a row for each nonempty combination of V2 and V3.

Exploring data with ggplot and plyr

Consider the diamonds dataset in ggplot2.

library(ggplot2)
str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
ppc <- ggplot(data = diamonds, aes(x = color, y= price/carat))
ppc + geom_jitter(alpha = 0.1)

plot of chunk unnamed-chunk-17

It looks like most samples in each color have similar price/carat values, but there are longer tails in some colors. Let’s efficiently look at the numbers.

#Add a variable for price/carat
diamonds2 <- transform(diamonds, price.p.carat = price/carat)
#Compute some summary statistics per color.
ppc_by_color_df <- ddply(diamonds2, .(color), function(df) summary(df$price.p.carat))
ppc_by_color_df
##   color Min. 1st Qu. Median Mean 3rd Qu.  Max.
## 1     D 1130    2460   3410 3950    4750 17800
## 2     E 1080    2430   3250 3800    4510 14600
## 3     F 1170    2590   3490 4130    4950 13900
## 4     G 1140    2540   3490 4160    5500 12500
## 5     H 1050    2400   3820 4010    5130 10200
## 6     I 1150    2340   3780 4000    5200  9400
## 7     J 1080    2560   3780 3830    4930  8650