Computing with Data - Homework 2

October 16, 2014 Due: October 29, 2014

Defining functions problems

  1. Write a function of one variable that for an input numeric vector, computes the mean of the samples above the median. Check that the input is a numeric vector and if it isn’t, return an appropriate message about the wrong function argument.
  2. Write a function that given a data.frame, reports the numbers of columns of each possible class. The possible classes are numeric, integer, character, logical, factor

Functions on matrices and lists

Execute the following to load objects into your session.

source("http://www3.nd.edu/~steve/computing_with_data_2014/9_Functions_matrices_lists/work_along_data_S9.R")
  1. Compute the sums of the non-negative values of each column of mat2.

Working with data frames

This project involves data on flight departure and arrival times from the Bureau of Transportation Statistics. We will be using a download from Ontime Table for flights in January, 2013. The flight data are in the table ONTIME1.csv that you have downloaded. I have included in the downlaod the files air_carrier_names.csv (a table with the long form names of airlines and a 2-3 letter code) and airport_codes.csv (a lookup table for the DOT integer codes).

This data is available at Ontime flight data. Clicking on the link should cause the linked file to download to your compute. Move the resulting .zip file to a folder in which you want to do the homework and unzip it. You should have a folder containing the 3 .csv files mentioned above.

First, create data.frames for these 3 .csv files. Use an option to read.csv to ensure that strings are imported as characters and not factors.

Answer the following questions.

  1. Are there any flights with missing CARRIER information?
  2. Are there any flights with carrier codes that aren’t found in the carrier look-up table?
  3. How many airlines had a flight in January 2013 in this database? HINT: Use unique

  4. Carefully read the help file for the merge function. Merge the ontime flight data with the airport codes to create a text field description of the origininating airport to the ontime flight data.frame. You’ll need to specify the columns in the ontime flight data.frame and the airport code data.frame that you want to match for the merge operation. Also, set the name of the airport description column to a name that clearly describes what it is. When done, use the str command to exhibit the characteristics of the new data.frame. Note that this didn’t introduce any new records.

  5. Use the lookup table of carrier codes to find the code for United Airlines (You can do that in Excel). Create a sub-data.frame containing all United Airlines flights.

  6. There isn’t a field exactly specifying in a “yes” or “no” whether there was a “DELAY”, so I’d like you to create one. Read the descriptions of the fields on the above website, look at samples of the records, and decide what property of one of the existing fields characterizes when there is a delay. Then create a new data.frame of the United Airlines flights with a new column that is a logical vector saying whether there is or isn’t an official delay.

  7. What percentage of United Airlines flight had a delay?

  8. Sort the data.frame generated in 8 by the amount of delay in decreasing order. HINT: use the order function.

  9. Which originating airports had the 3 longest delays for United Airlines in January 2013?

In doing this project, many natural questions should have occurred to you. Here are a few.

  • Which airlines had the best on-time performance record?
  • What are the prevalent reasons for delays?
  • Which airports had the greatest number of weather delays? How about the greatest number of different days with weather delays?

These are questions we can be naturally answer with lapply and similar functions.