As the complexity of problems increases, so does the length and width of the data and the questions we want to ask. Rarely, can this be accommodated smoothly with a singl;e data.frame and the pre-packaged functions. Yet, in order to function effectively as data analysts and scientists we need to focus our attention on the problem and not get bogged down in coding. Having the fastest and most conceptually natural methods for getting information from data is important.
Earlier we introduced the functions apply
, lapply
and sapply
for simultaneously applying a function to the rows or columns of an array or the components of a list. As one multi-step application, we coded a long complex variable selection problem into a list of data.frames, then applied a significance test function to each list component. The resulting p-values were placed in a single vector and analyzed together. In short, the steps are:
The plyr
library of functions enables us to do many problems like this involving arrays, data.frames and lists in a uniform way. There are also ways to parallelize the process that takes almost no effort (but a little thought).
One feature of the plyr
family that we have yet to see is the ability to divide a data.frame into sub-frames based on the value of a variable and then apply a function to the subframes. In fact, this is one of the most common uses of plyr
. We illustrate with an example.
setwd("~/Documents/Computing with Data/15_split_apply")
flight_ontime_status_df <- read.csv("./Data/flight_ontime_status.csv")
str(flight_ontime_status_df)
## 'data.frame': 486133 obs. of 21 variables:
## $ ORIGIN_AIRPORT_ID : int 12478 12478 12478 12478 12478 12478 12478 12478 12478 12478 ...
## $ ORIGIN_AIRPORT_SEQ_ID: int 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 ...
## $ ORIGIN_CITY_MARKET_ID: int 31703 31703 31703 31703 31703 31703 31703 31703 31703 31703 ...
## $ ORIGIN_STATE_ABR : Factor w/ 52 levels "AK","AL","AR",..: 33 33 33 33 33 33 33 33 33 33 ...
## $ DEST_AIRPORT_ID : int 12892 12892 12892 12892 12892 12892 12892 12892 12892 12892 ...
## $ DEST_AIRPORT_SEQ_ID : int 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 ...
## $ DEST_CITY_MARKET_ID : int 32575 32575 32575 32575 32575 32575 32575 32575 32575 32575 ...
## $ DEST_STATE_ABR : Factor w/ 52 levels "AK","AL","AR",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ DEP_TIME : int 855 921 931 904 858 911 902 855 858 852 ...
## $ DEP_DELAY : num -5 21 31 4 -2 11 2 -5 -2 -8 ...
## $ DEP_DELAY_NEW : num 0 21 31 4 0 11 2 0 0 0 ...
## $ ARR_TIME : int 1142 1210 1224 1151 1142 1151 1203 1129 1127 1134 ...
## $ ARR_DELAY : num -43 -15 -1 -34 -43 -34 -22 -56 -58 -51 ...
## $ ARR_DELAY_NEW : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DISTANCE : num 2475 2475 2475 2475 2475 ...
## $ CARRIER_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ WEATHER_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ NAS_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ SECURITY_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ LATE_AIRCRAFT_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ X : logi NA NA NA NA NA NA ...
This data is at the level of individual flights. Exploratory analysis of the data will call for analysis of the data broken down by airport or state or time of day, etc. Getting at this information requires some reorganization of the data.
This is a little fuzzy, but that’s common. Let’s first find the number of flights per airport. Each row represents a flight, so the number of times a given airport code appears is the number of flights for that airport. We can get this information with the table
function.
# Eliminate missing values in arrival delays
flight_ontime_status_df2 <- flight_ontime_status_df[!is.na(flight_ontime_status_df$ARR_DELAY_NEW),]
numb_flights_per_airport <- table(flight_ontime_status_df2$ORIGIN_AIRPORT_ID)
numb_flights_per_airport[1:5]
##
## 10135 10136 10140 10146 10155
## 245 198 2495 85 2
class(numb_flights_per_airport)
## [1] "table"
summary(as.numeric(numb_flights_per_airport))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 102 302 1670 1180 31100
Let’s exclude the first quartile with only around 100 flights.
active_airports <- names(numb_flights_per_airport)[numb_flights_per_airport > 100]
# subset the flight data to this set of airports
flight_status_active_df <- subset(flight_ontime_status_df2, ORIGIN_AIRPORT_ID %in% active_airports)
Getting, say, mean delay times for an individual airport is easy.
flight_status_active_df$ORIGIN_AIRPORT_ID[1:3]
## [1] 12478 12478 12478
# Take this as the example airport
mean(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW, na.rm=TRUE)
## [1] 8.475
# More informative is the summary statistics
summary(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 8.5 2.0 804.0
Looking at just one airport it’s hard to tell if this is a lot.
How do we compute the summary statistics of delays for all airports?
First, create a list of data.frames, each component corresponding to an airport and each data.frame the flights originating from that airport. We use lapply
.
arprt_flights_L <- lapply(active_airports, function(x) {
subset(flight_status_active_df, ORIGIN_AIRPORT_ID == x)
})
names(arprt_flights_L) <- active_airports
names(arprt_flights_L)[1:3]
## [1] "10135" "10136" "10140"
arprt_flights_L[["10135"]][1:3, 10:17]
## DEP_DELAY DEP_DELAY_NEW ARR_TIME ARR_DELAY ARR_DELAY_NEW DISTANCE
## 147708 -9 0 1429 -12 0 425
## 147709 -8 0 1413 -28 0 425
## 147710 -7 0 1423 -18 0 425
## CARRIER_DELAY WEATHER_DELAY
## 147708 NA NA
## 147709 NA NA
## 147710 NA NA
Compute the mean of the variable ARR_DELAY_NEW for each data.frame.
means_by_airport_L <- lapply(arprt_flights_L, function(df) mean(df$ARR_DELAY_NEW, na.rm=T))
names(means_by_airport_L) <- active_airports
means_by_airport_L[1:3]
## $`10135`
## [1] 16.15
##
## $`10136`
## [1] 6.485
##
## $`10140`
## [1] 6.947
Create a nice vector of the output.
means_by_airport <- unlist(means_by_airport_L)
means_by_airport[1:3]
## 10135 10136 10140
## 16.151 6.485 6.947
daply
and ddply
Here is a compact way to do all of these steps.
Load the library.
library(plyr)
The daply
function does all of the steps in one with no intermediate objects.
mean_delays_by_airport <- daply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) mean(df$ARR_DELAY_NEW, na.rm=T))
mean_delays_by_airport[1:5]
## 10135 10136 10140 10157 10185
## 16.151 6.485 6.947 24.058 11.180
class(mean_delays_by_airport)
## [1] "numeric"
length(mean_delays_by_airport)
## [1] 216
summary(mean_delays_by_airport)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.30 7.39 8.95 9.87 11.70 24.10
What if we want all summary statistics for each one. We should get a matrix or data.frame of values.
summary_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) summary(df$ARR_DELAY_NEW))
head(summary_delays_by_airport)
## ORIGIN_AIRPORT_ID Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 10135 0 0 0 16.20 5.0 527
## 2 10136 0 0 0 6.48 0.0 219
## 3 10140 0 0 0 6.95 2.0 374
## 4 10157 0 0 0 24.10 19.0 288
## 5 10185 0 0 0 11.20 4.5 202
## 6 10208 0 0 0 20.40 13.0 317
# Get the 90th and 95th percentile
top_few <- function(x) {quantile(x, probs=c(0.9, 0.95))}
top_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) top_few(df$ARR_DELAY_NEW))
head(top_delays_by_airport)
## ORIGIN_AIRPORT_ID 90% 95%
## 1 10135 29.0 78.8
## 2 10136 13.9 28.3
## 3 10140 20.0 38.0
## 4 10157 85.0 143.0
## 5 10185 35.0 71.1
## 6 10208 56.6 121.5
plyr
family of functionsdaply
, ddply
or dlply
data.frame input The family dzply takes the form dzply(data, .(v1, v2, …), function), where v1, v2, are variables. Here, the data.frame is subsetting by each combination of values of the variables to form a list of sub-data.frames. Then, the function is applied to each list component. output array or data.frame
The value of z indicates what you do with the output of the functions; i.e., return a list of outputs, form into a new data.frame by stacking the output, or form into an array. Some options may be impossible, depending on the output of the function.
df1 <- data.frame(V1 = paste("X", 1:10, sep=""), V2=c(letters[1:5], letters[1:5]),
V3=c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"), V4 = rnorm(10), V5=rnorm(10))
df1
## V1 V2 V3 V4 V5
## 1 X1 a A 0.78963 0.84490
## 2 X2 b A -0.84284 -0.68019
## 3 X3 c A -0.49802 0.16426
## 4 X4 d B -0.33562 1.06401
## 5 X5 e B -0.42798 -1.31098
## 6 X6 a B -0.14044 0.54070
## 7 X7 b C 0.01645 -1.45668
## 8 X8 c C 0.41199 -2.14611
## 9 X9 d D -0.76436 -0.02588
## 10 X10 e D -1.81736 1.28319
For insight into how the function creates sublists, let’s group the data by value of V3.
df_by_V3_L <- dlply(df1, .(V3), function(df) df)
df_by_V3_L
## $A
## V1 V2 V3 V4 V5
## 1 X1 a A 0.7896 0.8449
## 2 X2 b A -0.8428 -0.6802
## 3 X3 c A -0.4980 0.1643
##
## $B
## V1 V2 V3 V4 V5
## 1 X4 d B -0.3356 1.0640
## 2 X5 e B -0.4280 -1.3110
## 3 X6 a B -0.1404 0.5407
##
## $C
## V1 V2 V3 V4 V5
## 1 X7 b C 0.01645 -1.457
## 2 X8 c C 0.41199 -2.146
##
## $D
## V1 V2 V3 V4 V5
## 1 X9 d D -0.7644 -0.02588
## 2 X10 e D -1.8174 1.28319
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## V3
## 1 A
## 2 B
## 3 C
## 4 D
Find the mean of V4 values by the V3 subgroups. This should be returnable as a vector; i.e., a 1-dimensional array.
mns_V4_by_V3 <- daply(df1, .(V3), function(df) mean(df$V4))
mns_V4_by_V3
## A B C D
## -0.1837 -0.3013 0.2142 -1.2909
Reports the group means. Now repeat this but find the means by combinations of V2 and V3 values.
daply(df1, .(V2,V3), function(df) mean(df$V4))
## V3
## V2 A B C D
## a 0.7896 -0.1404 NA NA
## b -0.8428 NA 0.01645 NA
## c -0.4980 NA 0.41199 NA
## d NA -0.3356 NA -0.7644
## e NA -0.4280 NA -1.8174
Here, we report a 2-dimensional array; one for V2, one for V3. We can also output this as a data.frame, depending on what we want to do with it.
ddply(df1, .(V2,V3), function(df) mean(df$V4))
## V2 V3 V1
## 1 a A 0.78963
## 2 a B -0.14044
## 3 b A -0.84284
## 4 b C 0.01645
## 5 c A -0.49802
## 6 c C 0.41199
## 7 d B -0.33562
## 8 d D -0.76436
## 9 e B -0.42798
## 10 e D -1.81736
We have a row for each nonempty combination of V2 and V3.
Consider the diamonds dataset in ggplot2
.
library(ggplot2)
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
ppc <- ggplot(data = diamonds, aes(x = color, y= price/carat))
ppc + geom_jitter(alpha = 0.1)
It looks like most samples in each color have similar price/carat values, but there are longer tails in some colors. Let’s efficiently look at the numbers.
#Add a variable for price/carat
diamonds2 <- transform(diamonds, price.p.carat = price/carat)
#Compute some summary statistics per color.
ppc_by_color_df <- ddply(diamonds2, .(color), function(df) summary(df$price.p.carat))
ppc_by_color_df
## color Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 D 1130 2460 3410 3950 4750 17800
## 2 E 1080 2430 3250 3800 4510 14600
## 3 F 1170 2590 3490 4130 4950 13900
## 4 G 1140 2540 3490 4160 5500 12500
## 5 H 1050 2400 3820 4010 5130 10200
## 6 I 1150 2340 3780 4000 5200 9400
## 7 J 1080 2560 3780 3830 4930 8650