As the complexity of problems increases, so does the length and width of the data and the questions we want to ask. Rarely, can this be accommodated smoothly with a singl;e data.frame and thepre-packaged functions. Yet, in order to function effectively as data analysts and scientists we need to focus our attention on the problem and not get bogged down in coding. Having the fastest and most conceptually natural methods for getting information from data is important.
Earlier we introduced the functions apply
, lapply
and sapply
for simultaneously applying a function to the rows or columns of an array or the components of a list. As one multi-step application, we coded a long complex variable selection problem into a list of data.frames, then applied a significance test function to each list component. The resulting p-values were placed in a single vector and analyzed together. In short, the steps are:
The plyr
library of functions enables us to do many problems like this involving arrays, data.frames and lists in a uniform way. There are also ways to parallelize the process and take almost no effort (but a little thought).
One feature of the plyr
family that we have yet to see is the ability to divide a data.frame into sub-frames based on the value of a variable and then apply a function to the subframes. In fact, this is one of the most common uses of plyr
. We illustrate with an example.
setwd("~/Documents/Computing with Data/15_split_apply")
flight_ontime_status_df <- read.csv("./Data/flight_ontime_status.csv")
str(flight_ontime_status_df)
## 'data.frame': 486133 obs. of 21 variables:
## $ ORIGIN_AIRPORT_ID : int 12478 12478 12478 12478 12478 12478 12478 12478 12478 12478 ...
## $ ORIGIN_AIRPORT_SEQ_ID: int 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 1247802 ...
## $ ORIGIN_CITY_MARKET_ID: int 31703 31703 31703 31703 31703 31703 31703 31703 31703 31703 ...
## $ ORIGIN_STATE_ABR : Factor w/ 52 levels "AK","AL","AR",..: 33 33 33 33 33 33 33 33 33 33 ...
## $ DEST_AIRPORT_ID : int 12892 12892 12892 12892 12892 12892 12892 12892 12892 12892 ...
## $ DEST_AIRPORT_SEQ_ID : int 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 1289203 ...
## $ DEST_CITY_MARKET_ID : int 32575 32575 32575 32575 32575 32575 32575 32575 32575 32575 ...
## $ DEST_STATE_ABR : Factor w/ 52 levels "AK","AL","AR",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ DEP_TIME : int 855 921 931 904 858 911 902 855 858 852 ...
## $ DEP_DELAY : num -5 21 31 4 -2 11 2 -5 -2 -8 ...
## $ DEP_DELAY_NEW : num 0 21 31 4 0 11 2 0 0 0 ...
## $ ARR_TIME : int 1142 1210 1224 1151 1142 1151 1203 1129 1127 1134 ...
## $ ARR_DELAY : num -43 -15 -1 -34 -43 -34 -22 -56 -58 -51 ...
## $ ARR_DELAY_NEW : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DISTANCE : num 2475 2475 2475 2475 2475 ...
## $ CARRIER_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ WEATHER_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ NAS_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ SECURITY_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ LATE_AIRCRAFT_DELAY : num NA NA NA NA NA NA NA NA NA NA ...
## $ X : logi NA NA NA NA NA NA ...
This data is at the level of individual flights. Exploratory analysis of the data will call for analysis of the data broken down by airport or state or time of day, etc. Getting at this information requires some reorganization of the data.
This is a little fuzzy, but that's common. Let's first find the number of flights per airport. Each row represents a flight, so the number of times a given airport code appears is the number of flights for that airport. We can get this information with the table
function.
# Eliminate missing values in arrival delays
flight_ontime_status_df2 <- flight_ontime_status_df[!is.na(flight_ontime_status_df$ARR_DELAY_NEW),
]
numb_flights_per_airport <- table(flight_ontime_status_df2$ORIGIN_AIRPORT_ID)
numb_flights_per_airport[1:5]
##
## 10135 10136 10140 10146 10155
## 245 198 2495 85 2
class(numb_flights_per_airport)
## [1] "table"
summary(as.numeric(numb_flights_per_airport))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 102 302 1670 1180 31100
Let's exclude the first quartile with only around 100 flights.
active_airports <- names(numb_flights_per_airport)[numb_flights_per_airport >
100]
# subset the flight data to this set of airports
flight_status_active_df <- subset(flight_ontime_status_df2, ORIGIN_AIRPORT_ID %in%
active_airports)
Getting, say, mean delay times for an individual airport is easy.
flight_status_active_df$ORIGIN_AIRPORT_ID[1:3]
## [1] 12478 12478 12478
# Take this as the example airport
mean(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW,
na.rm = TRUE)
## [1] 8.475
# More informative is the summary statistics
summary(subset(flight_status_active_df, ORIGIN_AIRPORT_ID == 12478)$ARR_DELAY_NEW)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 8.5 2.0 804.0
Looking at just one airport it's hard to tell if this is a lot.
daply
and ddply
Load the library.
library(plyr)
mean_delays_by_airport <- daply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) mean(df$ARR_DELAY_NEW, na.rm = T))
mean_delays_by_airport[1:5]
## 10135 10136 10140 10157 10185
## 16.151 6.485 6.947 24.058 11.180
class(mean_delays_by_airport)
## [1] "numeric"
length(mean_delays_by_airport)
## [1] 216
summary(mean_delays_by_airport)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.30 7.39 8.95 9.87 11.70 24.10
What if we want all summary statistics for each one. We should get a matrix or data.frame of values.
summary_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) summary(df$ARR_DELAY_NEW))
head(summary_delays_by_airport)
## ORIGIN_AIRPORT_ID Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 10135 0 0 0 16.20 5.0 527
## 2 10136 0 0 0 6.48 0.0 219
## 3 10140 0 0 0 6.95 2.0 374
## 4 10157 0 0 0 24.10 19.0 288
## 5 10185 0 0 0 11.20 4.5 202
## 6 10208 0 0 0 20.40 13.0 317
# Get the 90th and 95th percentile
top_few <- function(x) {
quantile(x, probs = c(0.9, 0.95))
}
top_delays_by_airport <- ddply(flight_status_active_df, .(ORIGIN_AIRPORT_ID),
function(df) top_few(df$ARR_DELAY_NEW))
head(top_delays_by_airport)
## ORIGIN_AIRPORT_ID 90% 95%
## 1 10135 29.0 78.8
## 2 10136 13.9 28.3
## 3 10140 20.0 38.0
## 4 10157 85.0 143.0
## 5 10185 35.0 71.1
## 6 10208 56.6 121.5
plyr
family of functionsplyr
provides 6 functions of the form
xzply
where x and z can be a, d or l. These letters are short for array, data.frame and list, respectively. The first slot, held by x, refers to the type of input, z refers to the format of the output. The format of how they work is as follows:
The value of z indicates what you do with the output of the functions; i.e., return a list of outputs, form into a new data.frame by stacking the output, or form into an array. Some options may be impossible, depending on the output of the function.
The function llply
is largely similar to lapply
. The function sapply
can be replaced by laply
, but the latter is more general.
la_ex1 <- list(a = 1:3, b = 8:20, c = rep(5, times = 10))
la_ex1
## $a
## [1] 1 2 3
##
## $b
## [1] 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## $c
## [1] 5 5 5 5 5 5 5 5 5 5
laply(la_ex1, function(x) c(min(x), max(x)))
## 1 2
## [1,] 1 3
## [2,] 8 20
## [3,] 5 5
The output can be formed into a multidimensional array
The ldply
function is appropriate if we want to caputure more information or the output yields a mixture of character and numeric data.
ldply(la_ex1, function(x) c(min(x), max(x)))
## .id V1 V2
## 1 a 1 3
## 2 b 8 20
## 3 c 5 5
The data.frame output format includes a column of names of the list components, which can be useful.
df1 <- data.frame(V1 = paste("X", 1:10, sep = ""), V2 = c(letters[1:5], letters[1:5]),
V3 = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"), V4 = rnorm(10),
V5 = rnorm(10))
df1
## V1 V2 V3 V4 V5
## 1 X1 a A 0.34596 -0.9284
## 2 X2 b A -2.08124 0.6062
## 3 X3 c A 0.85163 -0.6130
## 4 X4 d B -1.26626 0.4843
## 5 X5 e B -0.62540 0.9086
## 6 X6 a B 1.29213 0.0165
## 7 X7 b C -0.07787 0.1515
## 8 X8 c C -1.27464 -0.5080
## 9 X9 d D -0.27981 1.7184
## 10 X10 e D -0.95905 -0.4103
For insight into how the function creates sublists, let's group the data by value of V3.
df_by_V3_L <- dlply(df1, .(V3), function(df) df)
df_by_V3_L
## $A
## V1 V2 V3 V4 V5
## 1 X1 a A 0.3460 -0.9284
## 2 X2 b A -2.0812 0.6062
## 3 X3 c A 0.8516 -0.6130
##
## $B
## V1 V2 V3 V4 V5
## 1 X4 d B -1.2663 0.4843
## 2 X5 e B -0.6254 0.9086
## 3 X6 a B 1.2921 0.0165
##
## $C
## V1 V2 V3 V4 V5
## 1 X7 b C -0.07787 0.1515
## 2 X8 c C -1.27464 -0.5080
##
## $D
## V1 V2 V3 V4 V5
## 1 X9 d D -0.2798 1.7184
## 2 X10 e D -0.9590 -0.4103
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## V3
## 1 A
## 2 B
## 3 C
## 4 D
Find the mean of V4 values by the V3 subgroups. This should be returnable as a vector; i.e., a 1-dimensional array.
mns_V4_by_V3 <- daply(df1, .(V3), function(df) mean(df$V4))
mns_V4_by_V3
## A B C D
## -0.2945 -0.1998 -0.6763 -0.6194
Reports the group means. Now repeat this but find the means by combinations of V2 and V3 values.
daply(df1, .(V2, V3), function(df) mean(df$V4))
## V3
## V2 A B C D
## a 0.3460 1.2921 NA NA
## b -2.0812 NA -0.07787 NA
## c 0.8516 NA -1.27464 NA
## d NA -1.2663 NA -0.2798
## e NA -0.6254 NA -0.9590
Here, we report a 2-dimensional array; one for V2, one for V3. We can also output this as a data.frame, depending on what we want to do with it.
ddply(df1, .(V2, V3), function(df) mean(df$V4))
## V2 V3 V1
## 1 a A 0.34596
## 2 a B 1.29213
## 3 b A -2.08124
## 4 b C -0.07787
## 5 c A 0.85163
## 6 c C -1.27464
## 7 d B -1.26626
## 8 d D -0.27981
## 9 e B -0.62540
## 10 e D -0.95905
We have a row for each nonempty combination of V2 and V3.
There are options for more than 2-dimensional arrays, but most uses are for matrices. Most commonly, the applications look a lot like uses of apply
.
m_ex1 <- matrix(rnorm(20), nrow = 5, ncol = 4)
m_ex1
## [,1] [,2] [,3] [,4]
## [1,] -0.3251 -1.41294 1.6946 1.5352
## [2,] -0.4346 0.64337 -1.0930 -0.2257
## [3,] 0.3087 -0.08587 -1.5555 -0.4460
## [4,] 0.2503 -0.54574 0.0514 -0.7895
## [5,] -1.4014 0.21887 -0.5316 -1.1855
aaply(m_ex1, 1, sum)
## 1 2 3 4 5
## 1.492 -1.110 -1.779 -1.034 -2.900
Most of the uses of plyr
use a data.frame as input. It's usually wise to put the output into a data.frame, too, either by ddply
or exporting an array that becomes a column or columns of a data.frame. This output can then be the input to further ddply
applications, gplot
plots, etc.