Here we illustrate markdown, knitr, and ggplot2 to motivate whatâ€™s to come. Weâ€™ll use a dataset on 10000 measurements of height and weight for men and women available through the book *Machine Learning for Hackers*, Drew Conway & John Myles-White, Oâ€™Reilly Media.

We begin by loading packages.

```
setwd("~/Documents/Computing with Data/2_Motivation/")
library(ggplot2)
```

First load the data.

```
ht_weight_df <- read.csv(file="../Data/01_heights_weights_genders.txt")
# str is short for structure(). It reports what's in the data.frame
str(ht_weight_df)
```

```
## 'data.frame': 10000 obs. of 3 variables:
## $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Height: num 73.8 68.8 74.1 71.7 69.9 ...
## $ Weight: num 242 162 213 220 206 ...
```

`head(ht_weight_df)`

```
## Gender Height Weight
## 1 Male 73.85 241.9
## 2 Male 68.78 162.3
## 3 Male 74.11 212.7
## 4 Male 71.73 220.0
## 5 Male 69.88 206.3
## 6 Male 67.25 152.2
```

We certainly expect dependence between height and weight. Letâ€™s plot the points on a plane and see what it looks like.

`plot(x= ht_weight_df$Height, y=ht_weight_df$Weight)`

It really does look like there is a strong linear relationship. Execute an `lm`

fit to see.

```
lm_ht_weight <- lm(Weight ~ Height, data = ht_weight_df)
summary(lm_ht_weight)
```

```
##
## Call:
## lm(formula = Weight ~ Height, data = ht_weight_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.93 -8.24 -0.12 8.26 46.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.7372 2.1115 -166 <2e-16 ***
## Height 7.7173 0.0318 243 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.2 on 9998 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.855
## F-statistic: 5.9e+04 on 1 and 9998 DF, p-value: <2e-16
```

There is an extremely strong linear relationship.

We expect that they do. Inspect the quantiles of height after restricting to each geneder.

```
# Subset the full data.frame by genders
male_df <- subset(ht_weight_df, Gender == "Male")
female_df <- subset(ht_weight_df, Gender == "Female")
# Get the summary values of height
summary(male_df$Height)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.4 67.2 69.0 69.0 71.0 79.0
```

`summary(female_df$Height)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.3 61.9 63.7 63.7 65.6 73.4
```

We see that the means and other order statistics are shifted. Are the entire distributions shifted, or is there some skewing by geneder?

`plot(density(male_df$Height))`

`plot(density(female_df$Height))`

For men there is a little bump around 6 feet. (Maybe the measurers generously rounded up the guys who are 5â€™11.5â€œ.) However the distributions look pretty close. It would be best if we could overlay the plots.

The package ggplot2 has had a significant impact on visualization in statistics. It will change the way you work.

```
dens_by_gender <- ggplot(data = ht_weight_df, aes(x=Height, color=Gender)) + geom_density()
dens_by_gender
```

This makes it more clear. It looks like these are two normal distributions with different means. We can check a Q-Q plot.

```
qq_by_gender <- ggplot(data = ht_weight_df, aes(sample=Height)) +
geom_point(stat="qq") + facet_wrap(~ Gender)
qq_by_gender
```