Scraping Data

What Is Scraping?

Scraping is just helping a machine read data intended for humans.

It comes in a few forms:

Screen

Report

Web

Structure Types

Structured – typical data formats

Semi-structured – modern sites

Unstructured – varying levels of doom

Structured Data Formats

json

html/xml

csv

json

{ 
  "reviewerID": "A2SUAM1J3GNN3B", 
  "asin": "0000013714", 
  "reviewerName": "J. McDonald", 
  "reviewText": "I bought this for my husband who plays the piano. 
                He is having a wonderful time playing these old hymns. 
                The music is at times hard to read because we think the 
                book was published for singing from more than playing from. 
                Great purchase though!", 
  "overall": 5.0, 
}

API

An API will make life much easier on everyone.

Grabbing parts out of the HTML is sometimes necessary.

Sometimes, sites will give us an API.

Always check – don’t bet on it.

Handy APIs

Twitter

Glassdoor

BLS

U.S. Census

LinkedIn

Most modern sites have one (or something resembling one).

API Limits

Many (if not most) APIs limit your quiries.

APIs usually do not give you all of the data that you want either.

This is where a combination approach becomes necessary.

Easing In

Easing In

Tables are the easiest thing to scrape.

Wikipedia

Consumer Price Index

Pretty Easy…Right?

Python Style

url = 
  "https://en.wikipedia.org/wiki/List_of_largest_banks_in_the_United_States"
  
banklist = pd.read_html(url)[0]

Pretty Easy Again?

R Style

url = 
  "https://en.wikipedia.org/wiki/List_of_largest_banks_in_the_United_States"

bankList = read_html(url) %>% # Read the html
  html_nodes("table") %>% # Grab "table" nodes
  extract2(1) %>% # Extract the first table
  html_table() # Save the table as a data frame

Data Ready To Go

Learning Some New Languages

CSS selectors and XPath make scraping certain objects easier.

The CSS Diner is a favorite!

Looking Under The Hood

(Maybe) Meeting A New Friend

No matter your browser, it will have an Inspect tool.

This tool will quickly become your friend.

Easier Still…

Excel Web Query is handy for grabbing tables.

It can even be refreshed.

Did You Bring A Machine?

If so, let’s give Excel a quick try.

We should be able to do it in under two minutes.

Semi and Unstructured Data

Whether it is 1 or 1000, tables are pretty easy to scrape.

But…not everything comes in tabular form.

This is where things become fun.

Requiring Some Thought

Most modern sites are pretty well constructed.

Yelp Reviews

Reviews

Still Nothing Too Hard

url = "https://www.yelp.com/biz/capri-granger"

yelpHTML = read_html(url)

ratings = yelpHTML %>% 
  html_nodes(".review-wrapper .review-content .i-stars") %>% 
  html_attr("title") %>% 
  stringr::str_extract("[0-5]")

reviews = yelpHTML %>% 
  html_nodes(".review-wrapper .review-content p") %>% 
  html_text()

Scraping Services

For these easy tables, we might be able to use a service.

Many free ones are available.

import.io

dexi.io

Getting Messy

Let’s play a little game.

MCoB Directory

Regular Expression

Pattern Matches
Ph.? ?D Ph.D, Ph. D, Ph. D, Ph D
\d{3}-\d{3}-\d{4} Phone numbers

RegExr

Dining Hall Fun

What’s good today?

A Note Of Caution

Scraping can get many things.

However, it is not magic.

We can’t scrape all of Google.

The laws of physics cannot be bent.

Some sites explicitly prohibit it.