Many people have made use of the data offered by the Texas Department of Criminal Justice (TDCJ). The web is scattered with word clouds and inmate portraits derived from last statements. Considerable attention has also been paid to some of the more “colorful” last statements.

At the CSR, we have been working on taking a more rigorous approach to modeling the last statements (see Michael Clark’s excellent document for more information detailing this project). While the last statements are very interesting, they are far from the only interesting data provided by the TDCJ. In addition to our structured topic models, we have also taken an interest in mapping some of the data.

One interesting plot that we have made is the county of offense of the executed inmates. In addition to simply plotting the points, we also wanted to look at the race of each executed inmate. We can do this with pretty standard techniques in R.


Getting the Data


We begin by scraping the data from the TDCJ website. When doing any scraping activities, I suggest that you always load XML and RCurl; if you are going to need one, you will probably need the other. Alternatively, you can use rvest and httr.

Grabbing the url is pretty easy. Now, we need to do some additional work to read the text within the url.

library(rvest)

deathRow = html("http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")

You can look at the original page and tell that we are dealing with a pretty nicely-formatted HTML table, so we will use the following code.

deathRow = html_table(deathRow)

deathRow = as.data.frame(deathRow)

deathRow = deathRow[-c(2:3)]

The use of magic numbers is not encouraged, but we are just dropping some inconsequential information from the initial HTML table.

deathRow$Age = as.numeric(as.character(deathRow$Age))
deathRow$Race = as.factor(deathRow$Race)
deathRow$County = as.factor(deathRow$County)
deathRow$Execution = as.numeric(as.character(deathRow$Execution))

Let’s take a peek at the structure of the data and a summary table.

We are off to a great start, but we still have some work ahead of us.


Preparing the Data for Geocoding


Before we can map anything, we need to have something to map. We can probably agree that the counties would be a good start, but how are we going to go about doing that?

What we will do, is to first take the counties and paste some words to them, specifically “county” and “texas”. We are doing this because we need something to pass to the geocode function other than the county names that are already there (think about what would happen if you passed Deaf Smith, Dickens, Real, or Tom Hogg to the Google API). Without pasting these words onto the counties, we would probably get quite a few errors.

deathRow$countyState = paste0(deathRow$county, sep=" ", "county", sep=", ", "texas")

That will give us the following:

## [1] " county, texas" " county, texas" " county, texas" " county, texas"
## [5] " county, texas" " county, texas"

We could use the tolower function to make everything lowercase, but it really is not going to matter for what we are going to do (it is a good practice, though).

Now that we have county and state, we can use ggmap to get the coordinates for the county seat; we will use the returned latitude and longitude to plot our points.

library(ggmap)
getGeocodes = geocode(initialData$countyState) 

This will take some time to complete. Although we do not have any problems here, there is a limit to the number of requests you can make to the Google API.

##         lon      lat
## 1 -97.69823 30.20970
## 2 -95.35209 32.82962
## 3 -95.31025 29.77518
## 4 -95.64580 30.68154
## 5 -96.49298 33.17952
## 6 -97.52472 26.12850

After we get the geocodes, we will bind them to our existing data.

executedData = cbind(initialData, geoGeocodes)

We now have some data that we can plot!


Getting a Map


Our next step is to get something on to which we will actually plot our points. We could get a map from R (e.g., ggmap or maps) or we could grab a shapefile from the web. Either is a viable solution, but we chose to go the shapefile route. You can get a shapefile from many different places (e.g., US Census Bureau’s TIGER/Line). We used the readOGR function from the rgdal package to pull the shapefile into R.

library(rgdal)
texShape0 = readOGR("Data_Files//Working//Other//StratMapv2_County_shp", "StratMapv2_County_poly")

This is clearly a relative path, so it will not work for you as it is currently written. You will need to save the entire shapefile folder to your machine. In the example, take note that the first argument is the the file folder and the second argument is calling the .shp file within the folder.

Now that we have our shapefile, we need to do some additional work so that we can do some plotting with it. Since we will ultimately be using ggplot2 for our plot, we need to fortify it.

library(ggplot2)
texShape0@data$id = row.names(texShape0@data$id)
texShape = fortify(texShape0)

Plotting the Data on our Map


Alright, everything is ready now! We will needed to add some jittering or the points would have been stacked on top of each other.

ggplot() + 
  geom_polygon(data=texShape, aes(x=long, y=lat, group=group), 
               color="black", fill="white") +

  geom_point(data=executedData, aes(x=lon, y=lat, color=race), 
             alpha=.5, position=position_jitter(height=.08, width=.09), size=3.5) +
  scale_color_discrete(name="Inmate Race",
                       breaks=c("black","hispanic","white","other"),
                       labels=c("Black", "Hispanic", "White", "Other")) +
  ggtheme + 
  labs(title="Inmates Executed in Texas Since 1982\n(County of Offense)") +
  theme(axis.ticks=element_blank(), axis.text.y=element_blank(), axis.text.x=element_blank(), axis.title.x=element_blank(),
        axis.title.y=element_blank(), plot.title=element_text(face="bold"))

The ggtheme function is a Michael Clark function that eliminates the gridlines and grey background; you can find it at his Github page.

Let’s take a look at what we did.

alt text

Since this is an svg file, you can easily zoom in to see things a bit more clearly. We could also do other things with the data (e.g., look at ages), but this is a good start! We can see that there are more crimes committed in places with higher population density (e.g., Harris, Bexar, Dallas). Admittedly, this might not be the most telling map every created, but it is interesting to see and we are taking a solid crack at making more use of the data that is out there. We have also been working on creating this map with ggvis. Although it is in the alpha stage, we can do neat things such as plotting the points year by year.

We could also use the leaflet package.

load("~/R/texasexecutions/Data_Files/Working/RData/texasExecutions.RData")
library(maps); library(leaflet)
txmap = map('county', 'Texas', plot=F)
leaflet(txmap) %>%
  addPolylines(weight=.5, color='black') %>%
  addCircles(lng=departedData$lon, lat=departedData$lat)