# Practice with plyr, ggplot2

February 20, 2014

setwd("/Users/steve/Documents/Computing with Data/18_plyr_practice/")
library(ggplot2)
library(plyr)
library(hflights)


### Preview the flight data

names(hflights)

 [1] "Year"              "Month"             "DayofMonth"
[4] "DayOfWeek"         "DepTime"           "ArrTime"
[7] "UniqueCarrier"     "FlightNum"         "TailNum"
[10] "ActualElapsedTime" "AirTime"           "ArrDelay"
[13] "DepDelay"          "Origin"            "Dest"
[16] "Distance"          "TaxiIn"            "TaxiOut"
[19] "Cancelled"         "CancellationCode"  "Diverted"


You should use str(hflights)

## How does the number of flights vary by day?

### Problem setup and discussion

• How do we compute the number of flights on a given day?
• What features of a day can effect the number of flights?
• We are summarizing the study to properties of the date

• What data features pin down the day?
• Add a new variable uniquely determining the day

hflights2 <- transform(hflights, date = paste(Month, "-", DayofMonth, sep=""))


There is anothor way: order the days and add an indicator for the day of the year. It takes a little more typing.

### Compute the number of flights for each day

• What will the result look like?
• What represents a single flight?
• For January 1, for example, how many flights were there?

### Flights on January 1

jan1_flights <- subset(hflights2, date == "1-1")
nrow(jan1_flights)

[1] 552


### Flights for every day

• What should the output look like?
• What tool will uniformly break up the flight data by date?
• We can use ddply or daply

### Flights for every day: solution

To produce a data.frame:

flights_by_day_df <- ddply(hflights2, .(date), function(df) nrow(df))

  date  V1
1  1-1 552
2 1-10 659
3 1-11 583

names(flights_by_day_df)[2] <- "FlightsByDay"


### Flights for every day: solution

To produce a vector; i.e., an array:

flights_by_day_vector <- daply(hflights2, .(date), function(df) nrow(df))
flights_by_day_vector[1:3]

 1-1 1-10 1-11
552  659  583


### Where do we go from here?

• What features may effect this number?
• Where is the data we need to answer these questions?
• What variables are characteristics of the date?
• How do we add the flights by day results to the rest of the data?

### Restrict to date-specific information

The variables are: Month, day of month, day of week, date

date_properties_df <- subset(hflights2, select=c(Month, DayofMonth, DayOfWeek, date))


How many rows here?

nrow(date_properties_df)

[1] 227496

date_properties_df <- unique(date_properties_df)


### Join all the data

When flights by day is a data.frame

date_properties_df2 <- merge(date_properties_df, flights_by_day_df)


When flights by day is a vector

date_properties_df2_2 <- transform(date_properties_df, FlightsByDay = flights_by_day_vector[date])


### What are the mean flight's per day by month?

This is a summary statistic by month.

flights_by_month <- daply(date_properties_df2, .(Month), function(df) mean(df$FlightsByDay)) flights_by_month[1:6]   1 2 3 4 5 6 610.0 611.7 628.1 619.8 618.5 653.3  flights_by_month[7:12]   7 8 9 10 11 12 662.8 650.8 602.2 603.1 600.7 616.7  There is a definite up-tick in the summer. ### What are the mean flights by day of the week? flights_by_day_week <- daply(date_properties_df2, .(DayOfWeek), function(df) mean(df$FlightsByDay))
flights_by_day_week

    1     2     3     4     5     6     7
660.8 608.6 614.0 671.2 672.5 521.3 616.5


Monday, Thursday and Friday have many more flights.