September 30, 2013 Due: October 8, 2013
This project involves data on flight departure and arrival times from the Bureau of Transportation Statistics. We will be using a download from Ontime Table for flights in January, 2013. The flight data are in the table ONTIME1.csv that you have downloaded. I have included in the downlaod the files aircarriernames.csv (a table with the long form names of airlines and a 2-3 letter code) and airport_codes.csv (a lookup table for the DOT integer codes).
This data is available at Ontime flight data. Clicking on the link should cause the linked file to download to your compute. Move the resulting .zip file to a folder in which you want to do the homework and unzip
it. You should have a folder containing the 3 .csv files mentioned above.
First, create data.frames for these 3 .csv files. Use an option to read.csv
to ensure that strings are imported as characters and not factors.
Answer the following questions.
How many airlines had a flight in January 2013 in this database? HINT: Use unique
Carefully read the help file for the merge
function. Merge the ontime flight data with the airport codes to create a text field description of the origininating airport to the ontime flight data.frame. You'll need to specify the columns in the ontime flight data.frame and the airport code data.frame that you want to match for the merge operation. Also, set the name of the airport description column to a name that clearly describes what it is. When done, use the str
command to exhibit the characteristics of the new data.frame. Note that this didn't introduce any new records.
Use the lookup table of carrier codes to find the code for United Airlines (You can do that in Excel). Create a sub-data.frame containing all United Airlines flights.
There isn't a field exactly specifying in a "yes" or "no" whether there was a "DELAY", so I'd like you to create one. Read the descriptions of the fields on the above website, look at samples of the records, and decide what property of one of the existing fields characterizes when there is a delay. Then create a new data.frame of the United Airlines flights with a new column that is a logical vector saying whether there is or isn't an official delay.
What percentage of United Airlines flight had a delay?
Sort the data.frame generated in 8 by the amount of delay in decreasing order. HINT: use the order
function.
Which originating airports had the 3 longest delays for United Airlines in January 2013?
In doing this project, many natural questions should have occurred to you. Here are a few.
These are questions we can naturally answer with lapply
and similar functions.