Introduction

Geocoded data can come from many different sources. Data from surveys and mobile devices often have geocodes and geocodes can also be obtained through the census. However, we do not need to rely on survey or census data for getting geocoded data. Instead, we can use geographic features within any data (e.g., state, city, or even address if it is available) to produce geocodes of that same level. For the current example, institutions from where ND social science faculty received their Ph.D.’s were scraped from the web and then the institutions were geocoded.

Scraping

Although it is slightly outside the scope of the current document, the importance of scraping cannot be glossed over. Common data sources will never go away (e.g., GSS, Census), but it is short-sighted to limit ourselves to that data. With the ability to scrape data from the web, nothing is really off limits. The ability to scrape data, coupled with a good object-oriented programming languages, gives us the power to collect and aggregate anything that we can find on the web. Whether it is Youtube comments, daily stock prices, Zillow listings, or tables from Wikipedia, we can grab anything. Although scraping is not difficult, we are often hampered by poor web design; so while many tasks are easy, others can be quite complicated.

As an example, here is all it took to get the Anthropology faculty into a usable form:

anthro = html("http://anthropology.nd.edu/faculty-and-staff/") %>%
  html_nodes("p") %>%
  html_text %>%
  str_extract("PhD .*") %>%
  as.character %>%
  str_replace("PhD ", "") %>%
  na.omit %>%
  str_replace(",", "") %>%
  str_replace("-", " ") %>% 
  tolower %>%
  as.data.frame %>%
  mutate(department = "Anthropology")

Although this represents the easiest example for the departments, it does give a pretty good representation of just how easy it can be to obtain unstructured data from the web.

The Data

In obtaining the geocodes for the universities, we are left with a dataset containing the Ph.D. institution (university), department, longitude (lon), and latitude (lat).

One logical product of this data is a map. Although not necessarily a requisite, maps are generally the first part of a GIS (after the data). A map can serve many useful functions (e.g., population densities for survey sampling) and are wonderful descriptive tools. As useful as maps are for descriptive and practical purposes, we should use our data to answer a research question. For the purposes of our current example, we are going to find out how the Ph.D. institutions of ND social science faculty are distributed in space.

Exploratory Mapping

Before diving into the meaty parts, we do need to explore the data a bit. The following interactive map is useful as a descriptive aid. It features frequencies for states, universities, and departments, in addition to showing university locations sized by frequency. Please do note that we have dropped the non-US universities (a few in Toronto and one in Amsterdam).

Point Pattern Analysis

Although the interactive map is nice for descriptives and is fun to click around on, we can take some different approaches to learning about how the universities are distributed. One way we can do this is by using PPA. PPA comes in many different forms, but we are ultimately going to end up testing if our data has CSR or if it is distributed in some other way (uniform/regular or clustered). In PPA, we view CSR as being the null hypothesis.

In PPA, we are not looking at the frequency with which a point occurs. Instead, we are only looking for a point’s presence and where it is in space compared to other points. The figure below contains all of the universities within our sample (the grey dots), the geographic mean of the universities, and an ellipse that captures the standard distance deviation of the universities. We have also done this for each individual department. The ellipses are largely stacked on top of each other (save for Anthropology), but they give us a good idea of how the points are distributed (the horizontal ovals match the shape of our data pretty nicely). If our points were distributed in a different manner, the “deviation ellipses” would give us an idea of shape and direction.

point pattern

You should notice that the mean locations are different from the mean locations in the interactive map. Recall that we are only looking at the presence of the points in the PPA, not the number of points at a location. We are only concerned that Harvard has a point, not that there are 8 points there. Within the PPA perspective, this makes intuitive sense. Things are rarely stacked on top of each other in nature (single-family homes, trees, beaver dams), so PPA does not deal with stacked points.

Intensity

We can also take a look at local intensity by using kernel intensity estimation. Kernel intensity estimation allows us to determine where in our space points are most likely to occur. In this particular example, we used likelihood cross-validation bandwidth (i.e., radius) selection. We could have used a number of different bandwidth methods (e.g., Diggle, Stoyan), but this method works well with our data (naturally); ideally, we would have selected a method based more on theory, but this is an exploratory process.

In selecting a method, we essentially have a choice between capturing highly-local events very well or viewing larger smoothed areas without the apparent presence of local instances. Methods like Diggle and Stoyan pick up on all of the universities better because they are calculating intensity with a smaller radius; however, these offer very little more than the point pattern plot for the data in our example. In the current example, we use a leave-one-out intensity calculation; in other words it does not assess the kernel intensity contribution of a point in calculating its own intensity.

In the following figure, we can see that the areas with the greatest intensities are clearly hitting where we have the most universities.

Ripley’s K Function

All of the descriptive stuff for the PPA is great, but we need to know how our point pattern is distributed. There are several different nearest neighbor analyses of spatial dependence from which we may choose (e.g., F, G), but we are going to use Ripley’s K function, \(K(r) = \lambda^{-1}\), for obvious reasons (less obviously, it handles more points and uses different scale lengths for estimation). The K function is not exactly a nearest neighbor function in a strict sense, but is often grouped with them.

Examining the plot tells us several things. The most obvious is that our data (represented by the black line) is above the CSR line. This would indicate that our point pattern is clustered. The next bit of information we get from this plot is that our data is not within the CSR envelope; this can be interpreted as “significant”" clustering. Finally, we can see that the clustering increases as our distance measure unit (r) increases. In looking at our point pattern, this is all readily apparent.

Next Steps

It would be helpful to have a reason as to why this is or is not important. It is clear that there is a great deal of clustering, but does that have any impact on anything important? One worthwhile inquiry would be to look at “multivariate” point patterns. This would allow us to compare the point patterns of the departments to other departments. As an example, we could compare the distribution of programs at Notre Dame to top programs at other institutions. We might hypothesize that highly-rated programs have even greater clustering than the equivalent program at Notre Dame. A brief examination of Notre Dame’s and Stanford’s Political Science departments lends preliminary support to this idea (Stanford’s mean point is more north and east and it has greater intensity in the New England coastal area; a document is in process).

In addition to comparing the spatial distribution of departments, we could also use demographic variables as covariates in a model. However, this data would require a great deal of manual work. Perhaps one of the more interesting avenues we could explore would be how/if the distribution has changed over time. Although this data might be difficult to acquire, it would allow us to see how departmental hiring trends have changed over time.