Practice with plyr, ggplot2

February 20, 2014

Load data and libraries

setwd("/Users/steve/Documents/Computing with Data/18_plyr_practice/")
library(ggplot2)
library(plyr)        
library(hflights)

Preview the flight data

names(hflights)
 [1] "Year"              "Month"             "DayofMonth"       
 [4] "DayOfWeek"         "DepTime"           "ArrTime"          
 [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
[10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
[13] "DepDelay"          "Origin"            "Dest"             
[16] "Distance"          "TaxiIn"            "TaxiOut"          
[19] "Cancelled"         "CancellationCode"  "Diverted"         

You should use str(hflights)

How does the number of flights vary by day?

Problem setup and discussion

  • How do we compute the number of flights on a given day?
  • What features of a day can effect the number of flights?
  • We are summarizing the study to properties of the date

Adding a date code

  • What data features pin down the day?
  • Add a new variable uniquely determining the day

Adding a date: answer

hflights2 <- transform(hflights, date = paste(Month, "-", DayofMonth, sep=""))

There is anothor way: order the days and add an indicator for the day of the year. It takes a little more typing.

Compute the number of flights for each day

Things to think about:

  • What will the result look like?
  • What represents a single flight?
  • For January 1, for example, how many flights were there?

Flights on January 1

jan1_flights <- subset(hflights2, date == "1-1")
nrow(jan1_flights)
[1] 552

Flights for every day

  • What should the output look like?
  • What tool will uniformly break up the flight data by date?
  • We can use ddply or daply

Flights for every day: solution

To produce a data.frame:

flights_by_day_df <- ddply(hflights2, .(date), function(df) nrow(df))
head(flights_by_day_df, n=3)
  date  V1
1  1-1 552
2 1-10 659
3 1-11 583
names(flights_by_day_df)[2] <- "FlightsByDay"

Flights for every day: solution

To produce a vector; i.e., an array:

flights_by_day_vector <- daply(hflights2, .(date), function(df) nrow(df))
flights_by_day_vector[1:3]
 1-1 1-10 1-11 
 552  659  583 

Where do we go from here?

  • What features may effect this number?
  • Where is the data we need to answer these questions?
  • What variables are characteristics of the date?
  • How do we add the flights by day results to the rest of the data?

Restrict to date-specific information

The variables are: Month, day of month, day of week, date

date_properties_df <- subset(hflights2, select=c(Month, DayofMonth, DayOfWeek, date))

How many rows here?

nrow(date_properties_df)
[1] 227496
date_properties_df <- unique(date_properties_df)

Join all the data

When flights by day is a data.frame

date_properties_df2 <- merge(date_properties_df, flights_by_day_df)

When flights by day is a vector

date_properties_df2_2 <- transform(date_properties_df, FlightsByDay = flights_by_day_vector[date])

What are the mean flight's per day by month?

This is a summary statistic by month.

flights_by_month <- daply(date_properties_df2, .(Month), function(df) mean(df$FlightsByDay))
flights_by_month[1:6]
    1     2     3     4     5     6 
610.0 611.7 628.1 619.8 618.5 653.3 
flights_by_month[7:12]
    7     8     9    10    11    12 
662.8 650.8 602.2 603.1 600.7 616.7 

There is a definite up-tick in the summer.

What are the mean flights by day of the week?

flights_by_day_week <- daply(date_properties_df2, .(DayOfWeek), function(df) mean(df$FlightsByDay))
flights_by_day_week
    1     2     3     4     5     6     7 
660.8 608.6 614.0 671.2 672.5 521.3 616.5 

Monday, Thursday and Friday have many more flights.