Computing with Data Homework

February 26, 2014

Due March 17

This problem set will give you experience using ggplot, plyr and analyses with many covariates.

You will use two sets of data for different problems.

Data 1: This is the Wisconsin breast cancer data we used for the study with kNN. You can load this into a data.frame using

wisc_bc_df <- read.csv("http://www3.nd.edu/~steve/computing_with_data/Data/wisc_bc_data.csv")

The second dataset is hflights data that you can load as

library(hflights)

Please note that this may not be available in R versions prior to 3.0.

Problems

  1. Using ggplot2 do a boxplot of the variable representing the worst reading of the points feature versus the diagnosis variable. Set the labels and names of the axes to be easily interpretable.
    HINT: Use the online help at http://docs.ggplot2.org/0.9.3.1/index.html to identify a geom that will produce a boxplot and use some scale variables to set the axis names and labels.

  2. Restrict the breast cancer data to the numeric variables. Using the points_worst variable as the response variable, execute a linear fit with each of the other numeric variables as a covariate. Select a measure of significance and report the 10 most significant variables, along with the measure of significance, ordered by significance.
    HINT: There are multiple ways to do this problem.

  3. Add to the hflights data.frame a variable for air speed; i.e., the distance divided by the time in the air. Please convert this to the standard measure of miles per hour. Using ggplot2, plot the density of the distribution of air speeds for all flights to Detroit (DTW).
    HINT: Look at the examples for the density geom on-line.

  4. For each destination (i) compute the variance of the air speeds; (ii) Among the destinations with at least 100 flights, which one has the highest variance?
    (iii) For the destination identified in (ii), does the mean airspeed vary significantly across the months.
    HINT: use plyr.

  5. The unique plane used for a flight is indicated by the tail number. For each plane that has more than 15 flights, compute the total number of miles flown in 2011. Which 5 airplanes had the most miles travelled and how many miles?