Many people have made use of the data offered by the Texas Department of Criminal Justice (TDCJ). The web is scattered with word clouds and inmate portraits derived from last statements. Considerable attention has also been paid to some of the more “colorful” last statements.
At the CSR, we have been working on taking a more rigorous approach to modeling the last statements (see Michael Clark’s excellent document for more information detailing this project). While the last statements are very interesting, they are far from the only interesting data provided by the TDCJ. In addition to our structured topic models, we have also taken an interest in mapping some of the data.
One interesting plot that we have made is the county of offense of the executed inmates. In addition to simply plotting the points, we also wanted to look at the race of each executed inmate. We can do this with pretty standard techniques in R.
We begin by scraping the data from the TDCJ website. When doing any scraping activities, I suggest that you always load XML and RCurl; if you are going to need one, you will probably need the other. Alternatively, you can use rvest and httr.
Grabbing the url is pretty easy. Now, we need to do some additional work to read the text within the url.
library(rvest)
deathRow = html("http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
You can look at the original page and tell that we are dealing with a pretty nicely-formatted HTML table, so we will use the following code.
deathRow = html_table(deathRow)
deathRow = as.data.frame(deathRow)
deathRow = deathRow[-c(2:3)]
The use of magic numbers is not encouraged, but we are just dropping some inconsequential information from the initial HTML table.
deathRow$Age = as.numeric(as.character(deathRow$Age))
deathRow$Race = as.factor(deathRow$Race)
deathRow$County = as.factor(deathRow$County)
deathRow$Execution = as.numeric(as.character(deathRow$Execution))
Let’s take a peek at the structure of the data and a summary table.
We are off to a great start, but we still have some work ahead of us.
Before we can map anything, we need to have something to map. We can probably agree that the counties would be a good start, but how are we going to go about doing that?
What we will do, is to first take the counties and paste some words to them, specifically “county” and “texas”. We are doing this because we need something to pass to the geocode function other than the county names that are already there (think about what would happen if you passed Deaf Smith, Dickens, Real, or Tom Hogg to the Google API). Without pasting these words onto the counties, we would probably get quite a few errors.
deathRow$countyState = paste0(deathRow$county, sep=" ", "county", sep=", ", "texas")
That will give us the following:
## [1] " county, texas" " county, texas" " county, texas" " county, texas"
## [5] " county, texas" " county, texas"
We could use the tolower function to make everything lowercase, but it really is not going to matter for what we are going to do (it is a good practice, though).
Now that we have county and state, we can use ggmap to get the coordinates for the county seat; we will use the returned latitude and longitude to plot our points.
library(ggmap)
getGeocodes = geocode(initialData$countyState)
This will take some time to complete. Although we do not have any problems here, there is a limit to the number of requests you can make to the Google API.
## lon lat
## 1 -97.69823 30.20970
## 2 -95.35209 32.82962
## 3 -95.31025 29.77518
## 4 -95.64580 30.68154
## 5 -96.49298 33.17952
## 6 -97.52472 26.12850
After we get the geocodes, we will bind them to our existing data.
executedData = cbind(initialData, geoGeocodes)
We now have some data that we can plot!
Our next step is to get something on to which we will actually plot our points. We could get a map from R (e.g., ggmap or maps) or we could grab a shapefile from the web. Either is a viable solution, but we chose to go the shapefile route. You can get a shapefile from many different places (e.g., US Census Bureau’s TIGER/Line). We used the readOGR function from the rgdal package to pull the shapefile into R.
library(rgdal)
texShape0 = readOGR("Data_Files//Working//Other//StratMapv2_County_shp", "StratMapv2_County_poly")
This is clearly a relative path, so it will not work for you as it is currently written. You will need to save the entire shapefile folder to your machine. In the example, take note that the first argument is the the file folder and the second argument is calling the .shp file within the folder.
Now that we have our shapefile, we need to do some additional work so that we can do some plotting with it. Since we will ultimately be using ggplot2 for our plot, we need to fortify it.
library(ggplot2)
texShape0@data$id = row.names(texShape0@data$id)
texShape = fortify(texShape0)
Alright, everything is ready now! We will needed to add some jittering or the points would have been stacked on top of each other.
ggplot() +
geom_polygon(data=texShape, aes(x=long, y=lat, group=group),
color="black", fill="white") +
geom_point(data=executedData, aes(x=lon, y=lat, color=race),
alpha=.5, position=position_jitter(height=.08, width=.09), size=3.5) +
scale_color_discrete(name="Inmate Race",
breaks=c("black","hispanic","white","other"),
labels=c("Black", "Hispanic", "White", "Other")) +
ggtheme +
labs(title="Inmates Executed in Texas Since 1982\n(County of Offense)") +
theme(axis.ticks=element_blank(), axis.text.y=element_blank(), axis.text.x=element_blank(), axis.title.x=element_blank(),
axis.title.y=element_blank(), plot.title=element_text(face="bold"))
The ggtheme function is a Michael Clark function that eliminates the gridlines and grey background; you can find it at his Github page.
Let’s take a look at what we did.