Analysis of career performance in top home run hitters

As an example of using plyr for an analysis that requires a lot of data manipulation, we consider a problem in understanding career performance in baseball players. Trying to project future performance of players is relevant to many in the profession, specifically, a team manager negotiating a multi-year contract and a player's agent. If we consider a player requesting a long-term contract who is 8 years in the league and has been a top home run hitter for a couple of years. How do we project his performance 5-10 years hence?

Intialize the plyr package.

library(plyr)

The Lahman package provides datasets of baseball statistics.

library(Lahman)
str(Batting)
## 'data.frame':    93955 obs. of  24 variables:
##  $ playerID : chr  "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
##  $ yearID   : int  2004 2006 2007 2008 2009 2010 1954 1955 1956 1957 ...
##  $ stint    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID   : Factor w/ 149 levels "ALT","ANA","ARI",..: 116 35 33 16 115 115 79 79 79 79 ...
##  $ lgID     : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 2 5 5 5 5 ...
##  $ G        : int  11 45 25 47 73 53 122 153 153 151 ...
##  $ G_batting: int  11 43 2 5 3 4 122 153 153 151 ...
##  $ AB       : int  0 2 0 1 0 0 468 602 609 615 ...
##  $ R        : int  0 0 0 0 0 0 58 105 106 118 ...
##  $ H        : int  0 0 0 0 0 0 131 189 200 198 ...
##  $ X2B      : int  0 0 0 0 0 0 27 37 34 27 ...
##  $ X3B      : int  0 0 0 0 0 0 6 9 14 6 ...
##  $ HR       : int  0 0 0 0 0 0 13 27 26 44 ...
##  $ RBI      : int  0 0 0 0 0 0 69 106 92 132 ...
##  $ SB       : int  0 0 0 0 0 0 2 3 2 1 ...
##  $ CS       : int  0 0 0 0 0 0 2 1 4 1 ...
##  $ BB       : int  0 0 0 0 0 0 28 49 37 57 ...
##  $ SO       : int  0 0 0 1 0 0 39 61 54 58 ...
##  $ IBB      : int  0 0 0 0 0 0 NA 5 6 15 ...
##  $ HBP      : int  0 0 0 0 0 0 3 3 2 0 ...
##  $ SH       : int  0 1 0 0 0 0 6 7 5 0 ...
##  $ SF       : int  0 0 0 0 0 0 4 4 7 3 ...
##  $ GIDP     : int  0 0 0 0 0 0 13 20 21 13 ...
##  $ G_old    : int  11 45 2 5 NA NA 122 153 153 151 ...
str(Master)
## 'data.frame':    17674 obs. of  35 variables:
##  $ lahmanID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ playerID    : chr  "aaronha01" "aaronto01" "aasedo01" "abadan01" ...
##  $ managerID   : chr  NA NA NA NA ...
##  $ hofID       : chr  "aaronha01h" NA NA NA ...
##  $ birthYear   : int  1934 1939 1954 1972 1854 1877 1869 1866 1862 1874 ...
##  $ birthMonth  : int  2 8 9 8 11 4 11 10 3 10 ...
##  $ birthDay    : int  5 5 8 25 4 15 29 14 16 22 ...
##  $ birthCountry: chr  "USA" "USA" "USA" "USA" ...
##  $ birthState  : chr  "AL" "AL" "CA" "FL" ...
##  $ birthCity   : chr  "Mobile" "Mobile" "Orange" "West Palm Beach" ...
##  $ deathYear   : int  NA 1984 NA NA 1905 1957 1962 1926 1930 1935 ...
##  $ deathMonth  : int  NA 8 NA NA 5 1 6 4 2 6 ...
##  $ deathDay    : int  NA 16 NA NA 17 6 11 27 13 11 ...
##  $ deathCountry: chr  NA "USA" NA NA ...
##  $ deathState  : chr  NA "GA" NA NA ...
##  $ deathCity   : chr  NA "Atlanta" NA NA ...
##  $ nameFirst   : chr  "Hank" "Tommie" "Don" "Andy" ...
##  $ nameLast    : chr  "Aaron" "Aaron" "Aase" "Abad" ...
##  $ nameNote    : chr  NA NA NA NA ...
##  $ nameGiven   : chr  "Henry Louis" "Tommie Lee" "Donald William" NA ...
##  $ nameNick    : chr  "Hammer,Hammerin' Hank,Bad Henry" NA NA NA ...
##  $ weight      : int  180 190 190 184 192 170 175 169 190 180 ...
##  $ height      : num  72 75 75 73 72 71 71 68 71 70 ...
##  $ bats        : chr  "R" "R" "R" "L" ...
##  $ throws      : chr  "R" "R" "R" "L" ...
##  $ debut       : Date, format: "1954-04-13" "1962-04-10" ...
##  $ finalGame   : Date, format: "1976-10-03" "1971-09-26" ...
##  $ college     : chr  NA NA "Cal St. Fullerton" "Middle Georgia JC" ...
##  $ lahman40ID  : chr  "aaronha01" "aaronto01" "aasedo01" "abadan01" ...
##  $ lahman45ID  : chr  "aaronha01" "aaronto01" "aasedo01" "abadan01" ...
##  $ retroID     : chr  "aaroh101" "aarot101" "aased001" "abada001" ...
##  $ holtzID     : chr  "aaronha01" "aaronto01" "aasedo01" "abadan01" ...
##  $ bbrefID     : chr  "aaronha01" "aaronto01" "aasedo01" "abadan01" ...
##  $ deathDate   : Date, format: NA "1984-08-16" ...
##  $ birthDate   : Date, format: "1934-02-05" "1939-08-05" ...

These are yearly total stats for players spanning the years:

min(Batting$year)
## [1] 1871
max(Batting$year)
## [1] 2010

Being a Hall of Fame player is based on a whole career of performance. The data are given per year and per player. To get career-long home run statistics we need to group the data by player and analyze total statistics for each player's sub-data.frame.

hr_stats_df <- ddply(Batting, .(playerID), function(df) c(mean(df$HR, na.rm = T), 
    max(df$HR, na.rm = T), sum(df$HR, na.rm = T), nrow(df)))
head(hr_stats_df)
##    playerID     V1 V2  V3 V4
## 1 aardsda01  0.000  0   0  6
## 2 aaronha01 32.826 47 755 23
## 3 aaronto01  1.857  8  13  7
## 4  aasedo01  0.000  0   0 13
## 5  abadan01  0.000  0   0  3
## 6  abadfe01  0.000  0   0  1
nrow(hr_stats_df)
## [1] 17453
names(hr_stats_df)[c(2, 3, 4, 5)] <- c("HR.mean", "HR.max", "HR.total", "career.length")

Our player of interest has been around for 8 years and is requesting a multi-year contract, so let's restrict our attention to players that were in the league for at least 10 years.

hr_stats_long_df <- subset(hr_stats_df, career.length >= 10)

Let's merge this data into the Batting data.frame so we have the other information to work with, too.

Batting_hr <- merge(Batting, hr_stats_long_df)

This both merged the data and restricted the players to the ones we identified with at least 10 years in the game.

In order to analyze how players perform over their careers using this yearly data we need to know which career-year a given year's record corresponds to. Let's look at one player and define a new variable that indicates this. Babe Ruth has playerID "ruthba01".

ruth_df <- subset(Batting_hr, playerID == "ruthba01")
min(ruth_df$yearID)
## [1] 1914
ruth_df2 <- transform(ruth_df, career.year = yearID - min(yearID) + 1)
head(ruth_df2)
##       playerID yearID stint teamID lgID   G G_batting  AB   R   H X2B X3B
## 38476 ruthba01   1931     1    NYA   AL 145       145 534 149 199  31   3
## 38477 ruthba01   1918     1    BOS   AL  95        95 317  50  95  26  11
## 38478 ruthba01   1932     1    NYA   AL 133       133 457 120 156  13   5
## 38479 ruthba01   1915     1    BOS   AL  42        42  92  16  29  10   1
## 38480 ruthba01   1934     1    NYA   AL 125       125 365  78 105  17   4
## 38481 ruthba01   1933     1    NYA   AL 137       137 459  97 138  21   3
##       HR RBI SB CS  BB SO IBB HBP SH SF GIDP G_old HR.mean HR.max HR.total
## 38476 46 163  5  4 128 51  NA   1  0 NA   NA   145   32.45     60      714
## 38477 11  66  6 NA  58 58  NA   2  3 NA   NA    95   32.45     60      714
## 38478 41 137  2  2 130 62  NA   2  0 NA   NA   133   32.45     60      714
## 38479  4  21  0 NA   9 23  NA   0  2 NA   NA    42   32.45     60      714
## 38480 22  84  1  3 104 63  NA   2  0 NA   NA   125   32.45     60      714
## 38481 34 103  4  5 114 90  NA   2  0 NA   NA   137   32.45     60      714
##       career.length career.year
## 38476            22          18
## 38477            22           5
## 38478            22          19
## 38479            22           2
## 38480            22          21
## 38481            22          20

To do this uniformly for all players we use ddply to group the data by player, we add the career.year variable calculated for this player, then join all the data.frames together into one.

Batting_hr_cy <- ddply(Batting_hr, .(playerID), function(df) transform(df, career.year = yearID - 
    min(yearID) + 1))
head(Batting_hr_cy)
##    playerID yearID stint teamID lgID   G G_batting  AB   R   H X2B X3B HR
## 1 aaronha01   1967     1    ATL   NL 155       155 600 113 184  37   3 39
## 2 aaronha01   1971     1    ATL   NL 139       139 495  95 162  22   3 47
## 3 aaronha01   1954     1    ML1   NL 122       122 468  58 131  27   6 13
## 4 aaronha01   1965     1    ML1   NL 150       150 570 109 181  40   1 32
## 5 aaronha01   1960     1    ML1   NL 153       153 590 102 172  20  11 40
## 6 aaronha01   1955     1    ML1   NL 153       153 602 105 189  37   9 27
##   RBI SB CS BB SO IBB HBP SH SF GIDP G_old HR.mean HR.max HR.total
## 1 109 17  6 63 97  19   0  0  6   11   155   32.83     47      755
## 2 118  1  1 71 58  21   2  0  5    9   139   32.83     47      755
## 3  69  2  2 28 39  NA   3  6  4   13   122   32.83     47      755
## 4  89 24  4 60 81  10   1  0  8   15   150   32.83     47      755
## 5 126 16  7 60 63  13   2  0 12    8   153   32.83     47      755
## 6 106  3  1 49 61   5   3  7  4   20   153   32.83     47      755
##   career.length career.year
## 1            23          14
## 2            23          18
## 3            23           1
## 4            23          12
## 5            23           7
## 6            23           2

To get deeper information about top career players it'll be helpful to analyze the most recent players and Hall of Fame players from the past. To this end, we'll segregate players that started before 1940 from those that started after 1950. This requires tagging each player with the starting year. We'll add a column to the data.frame with this information as follows.

start_year_df <- ddply(Batting_hr_cy, .(playerID), function(df) min(df$yearID))
head(start_year_df)
##    playerID   V1
## 1 aaronha01 1954
## 2  aasedo01 1977
## 3 abbated01 1897
## 4 abbotgl01 1973
## 5 abbotji01 1989
## 6 abbotku01 1993
names(start_year_df)[2] <- "start.year"
# Merge this with other data.
Batting_hr_cy2 <- merge(Batting_hr_cy, start_year_df)

Now we select subframes of players with separate starting years.

Batting_early <- subset(Batting_hr_cy2, start.year < 1940)
Batting_late <- subset(Batting_hr_cy2, start.year > 1950)

Select players with most home runs in both groups

Let's get the top 10 home run hitters in the early group and the late group. We could order our Batting data.frames by total home runs, but we'd have lots of rows for each player (one for each year played). It's simpler to form a new dataframe that just has the total home runs per player.

tot_HR_early <- subset(Batting_early, select = c(playerID, HR.total))
head(tot_HR_early, n = 3)
##     playerID HR.total
## 37 abbated01       11
## 38 abbated01       11
## 39 abbated01       11
# Remove the duplicate rows:
tot_HR_early <- unique(tot_HR_early)
head(tot_HR_early, n = 3)
##      playerID HR.total
## 37  abbated01       11
## 172 adamsba01        3
## 206 adamssp01        9

Now we want to sort this data.frame by total homeruns in descending order. The plyr package provides a nice function to do this, called arrange.

tot_HR_early_srt <- arrange(tot_HR_early, desc(HR.total))
head(tot_HR_early_srt, n = 3)
##    playerID HR.total
## 1  ruthba01      714
## 2  foxxji01      534
## 3 willite01      521
top10_HR_hitters_early <- tot_HR_early_srt[1:10, "playerID"]

Now repeat this for the later players.

tot_HR_late <- subset(Batting_late, select = c(playerID, HR.total))
# Remove the duplicate rows:
tot_HR_late <- unique(tot_HR_late)
tot_HR_late_srt <- arrange(tot_HR_late, desc(HR.total))
head(tot_HR_late_srt, n = 3)
##    playerID HR.total
## 1 bondsba01      762
## 2 aaronha01      755
## 3  mayswi01      660
top10_HR_hitters_late <- tot_HR_late_srt[1:10, "playerID"]

Analyze the career power features for early players

First restrict the Batting data to these players.

Batting_early_top10 <- subset(Batting_early, playerID %in% top10_HR_hitters_early)

To get a sense of how home run performance varies over the career, let's just plot that data. To exclude any anomalies due to time injured and not playing we'll plot the ratio of home runs over at bats per year instead of the raw home run numbers.

To begin, plot this for willite01.

library(ggplot2)
ggplot(data = subset(Batting_early_top10, playerID == "willite01"), aes(x = career.year, 
    y = HR/AB)) + geom_point()

plot of chunk unnamed-chunk-15

How does Babe Ruth look?

ggplot(data = subset(Batting_early_top10, playerID == "ruthba01"), aes(x = career.year, 
    y = HR/AB)) + geom_point()

plot of chunk unnamed-chunk-16

We can get a view of the performance of all ten players using a facet plot.

ggplot(data = Batting_early_top10, aes(x = career.year, y = HR/AB)) + geom_point() + 
    facet_wrap(~playerID, ncol = 3)

plot of chunk unnamed-chunk-17

We can get a sense of the trends by adding a loess fit line to the panel.

ggplot(data = Batting_early_top10, aes(x = career.year, y = HR/AB)) + geom_point() + 
    facet_wrap(~playerID, ncol = 3) + geom_smooth()

plot of chunk unnamed-chunk-18

Analyze the career power features for late players

First restrict the Batting data to these players.

Batting_late_top10 <- subset(Batting_late, playerID %in% top10_HR_hitters_late)

Let's jump to the final facet plot.

ggplot(data = Batting_late_top10, aes(x = career.year, y = HR/AB)) + geom_point() + 
    facet_wrap(~playerID, ncol = 3) + geom_smooth()

plot of chunk unnamed-chunk-20

In most players their skills declined after about 10-12 years or stayed flat. Also, they tend to climb to a plateau early (7-8 years). Steep linear power growth for 10 years and late decline in skills is characteristic of players significantly associated with steroid use (Barry Bonds, Mark McGwyer, Sammy Sosa). We don't see the same picture in Alex Rodriguez, who did take steroids, so nothing is certain.