Here we illustrate markdown, knitr, and ggplot2 to motivate whatâ€™s to come. Weâ€™ll use a dataset on 10000 measurements of height and weight for men and women available through the book Machine Learning for Hackers, Drew Conway & John Myles-White, Oâ€™Reilly Media.

``````setwd("~/Documents/Computing with Data/2_Motivation/")
library(ggplot2)``````

First load the data.

``````ht_weight_df <- read.csv(file="../Data/01_heights_weights_genders.txt")
# str is short for structure(). It reports what's in the data.frame
str(ht_weight_df)``````
``````## 'data.frame':    10000 obs. of  3 variables:
##  \$ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ Height: num  73.8 68.8 74.1 71.7 69.9 ...
##  \$ Weight: num  242 162 213 220 206 ...``````
``head(ht_weight_df)``
``````##   Gender Height Weight
## 1   Male  73.85  241.9
## 2   Male  68.78  162.3
## 3   Male  74.11  212.7
## 4   Male  71.73  220.0
## 5   Male  69.88  206.3
## 6   Male  67.25  152.2``````

## Relationship between height and weight

We certainly expect dependence between height and weight. Letâ€™s plot the points on a plane and see what it looks like.

``plot(x= ht_weight_df\$Height, y=ht_weight_df\$Weight)``

It really does look like there is a strong linear relationship. Execute an `lm` fit to see.

``````lm_ht_weight <- lm(Weight ~ Height, data = ht_weight_df)
summary(lm_ht_weight)``````
``````##
## Call:
## lm(formula = Weight ~ Height, data = ht_weight_df)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -51.93  -8.24  -0.12   8.26  46.84
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.7372     2.1115    -166   <2e-16 ***
## Height         7.7173     0.0318     243   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.2 on 9998 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.855
## F-statistic: 5.9e+04 on 1 and 9998 DF,  p-value: <2e-16``````

There is an extremely strong linear relationship.

## How do height and weight depend on gender?

We expect that they do. Inspect the quantiles of height after restricting to each geneder.

``````# Subset the full data.frame by genders
male_df <- subset(ht_weight_df, Gender == "Male")
female_df <- subset(ht_weight_df, Gender == "Female")
# Get the summary values of height
summary(male_df\$Height)``````
``````##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    58.4    67.2    69.0    69.0    71.0    79.0``````
``summary(female_df\$Height)``
``````##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    54.3    61.9    63.7    63.7    65.6    73.4``````

### Comparing the densities by gender

We see that the means and other order statistics are shifted. Are the entire distributions shifted, or is there some skewing by geneder?

``plot(density(male_df\$Height))``

``plot(density(female_df\$Height))``

For men there is a little bump around 6 feet. (Maybe the measurers generously rounded up the guys who are 5â€™11.5â€œ.) However the distributions look pretty close. It would be best if we could overlay the plots.

#### Using ggplot2 to overlay plots

The package ggplot2 has had a significant impact on visualization in statistics. It will change the way you work.

``````dens_by_gender <- ggplot(data = ht_weight_df, aes(x=Height, color=Gender)) + geom_density()
dens_by_gender``````

This makes it more clear. It looks like these are two normal distributions with different means. We can check a Q-Q plot.

``````qq_by_gender <- ggplot(data = ht_weight_df, aes(sample=Height)) +
geom_point(stat="qq") + facet_wrap(~ Gender)
qq_by_gender``````