# Computing with Data Homework

## February 26, 2014

Due March 17

This problem set will give you experience using `ggplot`, `plyr` and analyses with many covariates.

You will use two sets of data for different problems.

Data 1: This is the Wisconsin breast cancer data we used for the study with kNN. You can load this into a data.frame using

``````wisc_bc_df <- read.csv("http://www3.nd.edu/~steve/computing_with_data/Data/wisc_bc_data.csv")
``````

The second dataset is `hflights` data that you can load as

``````library(hflights)
``````

Please note that this may not be available in R versions prior to 3.0.

## Problems

1. Using `ggplot2` do a boxplot of the variable representing the worst reading of the points feature versus the diagnosis variable. Set the labels and names of the axes to be easily interpretable.
HINT: Use the online help at http://docs.ggplot2.org/0.9.3.1/index.html to identify a geom that will produce a boxplot and use some scale variables to set the axis names and labels.

2. Restrict the breast cancer data to the numeric variables. Using the points_worst variable as the response variable, execute a linear fit with each of the other numeric variables as a covariate. Select a measure of significance and report the 10 most significant variables, along with the measure of significance, ordered by significance.
HINT: There are multiple ways to do this problem.

3. Add to the `hflights` data.frame a variable for air speed; i.e., the distance divided by the time in the air. Please convert this to the standard measure of miles per hour. Using `ggplot2`, plot the density of the distribution of air speeds for all flights to Detroit (DTW).
HINT: Look at the examples for the density geom on-line.

4. For each destination (i) compute the variance of the air speeds; (ii) Among the destinations with at least 100 flights, which one has the highest variance?
(iii) For the destination identified in (ii), does the mean airspeed vary significantly across the months.
HINT: use `plyr`.

5. The unique plane used for a flight is indicated by the tail number. For each plane that has more than 15 flights, compute the total number of miles flown in 2011. Which 5 airplanes had the most miles travelled and how many miles?