Here we illustrate markdown, knitr, and ggplot2 to motivate what’s to come. We’ll use a dataset on 10000 measurements of height and weight for men and women available through the book Machine Learning for Hackers, Drew Conway & John Myles-White, O’Reilly Media.
We begin by loading packages.
setwd("~/Documents/Computing with Data/2_Motivation/")
library(ggplot2)
First load the data.
ht_weight_df <- read.csv(file="../Data/01_heights_weights_genders.txt")
# str is short for structure(). It reports what's in the data.frame
str(ht_weight_df)
## 'data.frame': 10000 obs. of 3 variables:
## $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Height: num 73.8 68.8 74.1 71.7 69.9 ...
## $ Weight: num 242 162 213 220 206 ...
head(ht_weight_df)
## Gender Height Weight
## 1 Male 73.85 241.9
## 2 Male 68.78 162.3
## 3 Male 74.11 212.7
## 4 Male 71.73 220.0
## 5 Male 69.88 206.3
## 6 Male 67.25 152.2
We certainly expect dependence between height and weight. Let’s plot the points on a plane and see what it looks like.
plot(x= ht_weight_df$Height, y=ht_weight_df$Weight)
It really does look like there is a strong linear relationship. Execute an lm
fit to see.
lm_ht_weight <- lm(Weight ~ Height, data = ht_weight_df)
summary(lm_ht_weight)
##
## Call:
## lm(formula = Weight ~ Height, data = ht_weight_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.93 -8.24 -0.12 8.26 46.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.7372 2.1115 -166 <2e-16 ***
## Height 7.7173 0.0318 243 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.2 on 9998 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.855
## F-statistic: 5.9e+04 on 1 and 9998 DF, p-value: <2e-16
There is an extremely strong linear relationship.
We expect that they do. Inspect the quantiles of height after restricting to each geneder.
# Subset the full data.frame by genders
male_df <- subset(ht_weight_df, Gender == "Male")
female_df <- subset(ht_weight_df, Gender == "Female")
# Get the summary values of height
summary(male_df$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.4 67.2 69.0 69.0 71.0 79.0
summary(female_df$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.3 61.9 63.7 63.7 65.6 73.4
We see that the means and other order statistics are shifted. Are the entire distributions shifted, or is there some skewing by geneder?
plot(density(male_df$Height))
plot(density(female_df$Height))
For men there is a little bump around 6 feet. (Maybe the measurers generously rounded up the guys who are 5’11.5“.) However the distributions look pretty close. It would be best if we could overlay the plots.
The package ggplot2 has had a significant impact on visualization in statistics. It will change the way you work.
dens_by_gender <- ggplot(data = ht_weight_df, aes(x=Height, color=Gender)) + geom_density()
dens_by_gender
This makes it more clear. It looks like these are two normal distributions with different means. We can check a Q-Q plot.
qq_by_gender <- ggplot(data = ht_weight_df, aes(sample=Height)) +
geom_point(stat="qq") + facet_wrap(~ Gender)
qq_by_gender
Yes, these are normal.
We saw earlier that there is a good linear fit of weight versus height, but the distributions of these vary by gender. It’s natural to ask if there is a significant difference between the linear fits. First plot the points colored by gender.
ht_wt_pt_gender <- ggplot(data = ht_weight_df, aes(x = Height, y=Weight, color=Gender)) +
geom_point(alpha=0.2)
ht_wt_pt_gender
ht_wt_pt_gender2 <- ht_wt_pt_gender + geom_smooth(method="lm", data = subset(ht_weight_df, Gender == "Male"), color="black") + geom_smooth(method="lm", data = subset(ht_weight_df, Gender == "Female"), color="black")
ht_wt_pt_gender2
It looks like the lines should have the same slope. Let’s turn back to statistics and compute the fits with a factor for gender.
lm_ht_wt_by_gender <- lm( Weight ~ Height*Gender, data = ht_weight_df)
summary(lm_ht_wt_by_gender)
##
## Call:
## lm(formula = Weight ~ Height * Gender, data = ht_weight_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.19 -6.80 -0.12 6.81 35.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -246.0133 3.3497 -73.44 <2e-16 ***
## Height 5.9940 0.0525 114.10 <2e-16 ***
## GenderMale 21.5144 4.7853 4.50 7e-06 ***
## Height:GenderMale -0.0323 0.0722 -0.45 0.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10 on 9996 degrees of freedom
## Multiple R-squared: 0.903, Adjusted R-squared: 0.903
## F-statistic: 3.09e+04 on 3 and 9996 DF, p-value: <2e-16
So, the intercepts are statistically different but the slopes are not.