Importing data of different sizes, shapes and sources

Data can be found in many different forms and locations. Data may be in a spreadsheet we get as a file from some source or it may be provided over an HTTP connection with some web service. It may be small (~ 1000 records) or very large (2 billion records). We'll give examples of how we handle multiple sources and forms of data.

setwd("~/Documents/Computing with Data/23_data_import/")

Connections

R establishes a connection with a file or database through an object called a connection. The connection type can tell R whether the file is text or binary, compressed or uncompressed, etc. Creating a formal R object as a connection has advantages over always referencing a path to the file.

Basic file connections

A connection to a file can be created with the file command. It can be opened for reading or writing or both. When writing to a file there is an option to append output to the end of the file. When done you should explicitly close the file.

file1 <- file("test_file1.txt", "w")  # Opens the file for writing
cat("New line", "second part of new output", sep = "\n", file = file1)
close(file1)
cat("", "third part of new output", sep = "\n", file = "test_file1.txt", append = TRUE)

It's possible to redirect output from the console to a text file by opening a connection and using the sink command. Another type of connection is a pipe. A pipe in Unix shell scripting is a way of passing output from one command to another command.

Reading files

The function readLines applied to a file or a connection reads the entire file and returns the lines in the file as a character vector. More generally, scan reads a file and splits on a symbol of your choice. You can specify a maximum number of symbols read.

textLines <- readLines(con = "../22_parallel/parallel_foreach.Rmd")
## Warning: incomplete final line found on
## '../22_parallel/parallel_foreach.Rmd'
textLines[1:5]
## [1] "Parallel computing, error catching and profiling"                                                                                                                                                                                                      
## [2] "============"                                                                                                                                                                                                                                          
## [3] ""                                                                                                                                                                                                                                                      
## [4] "This is a real-world example in which parallelization will give a speed boost. Along the way, we'll show how to handle a situation in which an error occurs in one hard-to-find iteration of loop. We'll also discuss profiling **R** code for speed. "
## [5] ""
textBlob <- scan(file = "../22_parallel/parallel_foreach.Rmd", what = "character")
textBlob[1:5]
## [1] "Parallel"   "computing," "error"      "catching"   "and"

Compressed files

R can open and read compressed files without uncompressing them outside of R.

con_gz <- gzfile(description = "./GSM107384.CEL.gz")
read2 <- readLines(con_gz, n = 10)

URLs

Connections can also be made to files identified over the internet with URLs, like "http://", "ftp://" using the function url. For convenience many functions that expect a file path as input can understand URLs.

con_int <- file("http://www3.nd.edu/~steve/computing_with_data/index.html")
readLines(con_int)[1:4]
## Warning: incomplete final line found on
## 'http://www3.nd.edu/~steve/computing_with_data/index.html'
## [1] "<!DOCTYPE html>"                                                          
## [2] "<html>"                                                                   
## [3] "<head>"                                                                   
## [4] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"

Binary files

These can read as well, although it may be hard to process unless it has a very special format.

Connections to databases

Very useful from the point of view of statistical analysis of data are connections to databases. There are packages for making connections to databases in such formats as mysql, sql-lite, oracle, etc. Once a connection is made, queries can be written within R in the SQL language and records extracted accordingly.

Here is an example of how it would work:

library(RMySQL) 
require(ETLUtils) 

login <- list() 
login$user <- "genome" 
login$host <- "genome-mysql.cse.ucsc.edu" 

x <- read.dbi.ffdf( 
query = "select * from mm9.knownGene", 
dbConnect.args = list(drv = dbDriver("MySQL"), 
   user = login$user, host = login$host), 
VERBOSE=TRUE) 

Often you'll want to just select few records rather than all tables because of the sizes of these databases.

Text connections

Text connections are another source of input. They allow R character vectors to be read as if the lines were being read from a text file. A text connection is created and opened by a call to textConnection, which copies the current contents of the character vector to an internal buffer at the time of creation.

Text connections can also be used to capture R output to a character vector. textConnection can be asked to create a new character object or append to an existing one, in both cases in the user’s workspace. The connection is opened by the call to textConnection, and at all times the complete lines output to the connection are available in the R object. Closing the connection writes any remaining output to a final element of the character vector.

zz <- textConnection("This is a text connection")
scan(file = zz, what = "character")
## [1] "This"       "is"         "a"          "text"       "connection"

Getting data from large files

A limitation of base R in accessing information in files is that you have to read all of it into memory before processing. Yes, you can scan increasingly large chunks of it, or blocks of increasing size, but then it's very tough to know you've processed everything. If you only want a subset of the file, this is very frustrating.

In languages like Perl and Python there are commands that allow you to read a file one line at a time. The algorithm looks like

for r ranging over the lines in the file:
    test if r satisfies a condition
    if yes, apply the function f to r and store the output
end

In this way you only read in one line at a time, meaning small memory requirements,

This reads like an iterator and that's what we can use in R to process files a line at a time. The iterator package has the function.

We have stored on disk a file containing 9 million records from medicare providers concerning billing and treatment numbers in tab delimited format. It totals 1.79 GB. It could be read into memory (perhaps eventually). Suppose we just want data on Indiana providers.

As a start, let's scan in enough of the file to get a sense of how it's organized.

samp_med <- scan(file = "../Data/Medicare-Physician-and-Other-Supplier-PUF-CY2012/Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt", 
    what = "character", nmax = 100)
samp_med
##   [1] "npi"                              "nppes_provider_last_org_name"    
##   [3] "nppes_provider_first_name"        "nppes_provider_mi"               
##   [5] "nppes_credentials"                "nppes_provider_gender"           
##   [7] "nppes_entity_code"                "nppes_provider_street1"          
##   [9] "nppes_provider_street2"           "nppes_provider_city"             
##  [11] "nppes_provider_zip"               "nppes_provider_state"            
##  [13] "nppes_provider_country"           "provider_type"                   
##  [15] "medicare_participation_indicator" "place_of_service"                
##  [17] "hcpcs_code"                       "hcpcs_description"               
##  [19] "line_srvc_cnt"                    "bene_unique_cnt"                 
##  [21] "bene_day_srvc_cnt"                "average_Medicare_allowed_amt"    
##  [23] "stdev_Medicare_allowed_amt"       "average_submitted_chrg_amt"      
##  [25] "stdev_submitted_chrg_amt"         "average_Medicare_payment_amt"    
##  [27] "stdev_Medicare_payment_amt"       "0000000001"                      
##  [29] "CPT"                              "copyright"                       
##  [31] "2011"                             "American"                        
##  [33] "Medical"                          "Association."                    
##  [35] "All"                              "Rights"                          
##  [37] "Reserved."                        "1003000126"                      
##  [39] "ENKESHAFI"                        "ARDALAN"                         
##  [41] "M.D."                             "M"                               
##  [43] "I"                                "900"                             
##  [45] "SETON"                            "DR"                              
##  [47] "CUMBERLAND"                       "215021854"                       
##  [49] "MD"                               "US"                              
##  [51] "Internal"                         "Medicine"                        
##  [53] "Y"                                "F"                               
##  [55] "99222"                            "Initial"                         
##  [57] "hospital"                         "care"                            
##  [59] "115"                              "112"                             
##  [61] "115"                              "135.25"                          
##  [63] "0"                                "199"                             
##  [65] "0"                                "108.11565217"                    
##  [67] "0.9005883395"                     "1003000126"                      
##  [69] "ENKESHAFI"                        "ARDALAN"                         
##  [71] "M.D."                             "M"                               
##  [73] "I"                                "900"                             
##  [75] "SETON"                            "DR"                              
##  [77] "CUMBERLAND"                       "215021854"                       
##  [79] "MD"                               "US"                              
##  [81] "Internal"                         "Medicine"                        
##  [83] "Y"                                "F"                               
##  [85] "99223"                            "Initial"                         
##  [87] "hospital"                         "care"                            
##  [89] "93"                               "88"                              
##  [91] "93"                               "198.59"                          
##  [93] "0"                                "291"                             
##  [95] "9.5916630466"                     "158.87"                          
##  [97] "0"                                "1003000126"                      
##  [99] "ENKESHAFI"                        "ARDALAN"
samp_med2 <- readLines(con = "../Data/Medicare-Physician-and-Other-Supplier-PUF-CY2012/Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt", 
    n = 3)
samp_med2
## [1] "npi\tnppes_provider_last_org_name\tnppes_provider_first_name\tnppes_provider_mi\tnppes_credentials\tnppes_provider_gender\tnppes_entity_code\tnppes_provider_street1\tnppes_provider_street2\tnppes_provider_city\tnppes_provider_zip\tnppes_provider_state\tnppes_provider_country\tprovider_type\tmedicare_participation_indicator\tplace_of_service\thcpcs_code\thcpcs_description\tline_srvc_cnt\tbene_unique_cnt\tbene_day_srvc_cnt\taverage_Medicare_allowed_amt\tstdev_Medicare_allowed_amt\taverage_submitted_chrg_amt\tstdev_submitted_chrg_amt\taverage_Medicare_payment_amt\tstdev_Medicare_payment_amt"
## [2] "0000000001\tCPT copyright 2011 American Medical Association.  All Rights Reserved.\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "1003000126\tENKESHAFI\tARDALAN\t\tM.D.\tM\tI\t900 SETON DR\t\tCUMBERLAND\t215021854\tMD\tUS\tInternal Medicine\tY\tF\t99222\tInitial hospital care\t115\t112\t115\t135.25\t0\t199\t0\t108.11565217\t0.9005883395"

OK. The first line has the headers, the second is copyright, and the others are real data in tab-delimited format. Records are separated by new line indicators.

How do we use iterator?

library(iterators)
library(foreach)

Ther function we need to understand is ireadLines.

con_obj <- file("../Data/Medicare-Physician-and-Other-Supplier-PUF-CY2012/Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt", 
    "r")
con_line_iterator <- ireadLines(con_obj, n = 1)
nextElem(con_line_iterator)
## [1] "npi\tnppes_provider_last_org_name\tnppes_provider_first_name\tnppes_provider_mi\tnppes_credentials\tnppes_provider_gender\tnppes_entity_code\tnppes_provider_street1\tnppes_provider_street2\tnppes_provider_city\tnppes_provider_zip\tnppes_provider_state\tnppes_provider_country\tprovider_type\tmedicare_participation_indicator\tplace_of_service\thcpcs_code\thcpcs_description\tline_srvc_cnt\tbene_unique_cnt\tbene_day_srvc_cnt\taverage_Medicare_allowed_amt\tstdev_Medicare_allowed_amt\taverage_submitted_chrg_amt\tstdev_submitted_chrg_amt\taverage_Medicare_payment_amt\tstdev_Medicare_payment_amt"
nextElem(con_line_iterator)
## [1] "0000000001\tCPT copyright 2011 American Medical Association.  All Rights Reserved.\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"

To work with these records we should split the lines on \t. For example, here is how we get the field names.

con_obj <- file("../Data/Medicare-Physician-and-Other-Supplier-PUF-CY2012/Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt", 
    "r")
con_line_iterator <- ireadLines(con_obj, n = 1)
header_line <- nextElem(con_line_iterator)
header_con <- textConnection(header_line)
headers <- scan(file = header_con, what = "character", sep = "\t")
headers
##  [1] "npi"                              "nppes_provider_last_org_name"    
##  [3] "nppes_provider_first_name"        "nppes_provider_mi"               
##  [5] "nppes_credentials"                "nppes_provider_gender"           
##  [7] "nppes_entity_code"                "nppes_provider_street1"          
##  [9] "nppes_provider_street2"           "nppes_provider_city"             
## [11] "nppes_provider_zip"               "nppes_provider_state"            
## [13] "nppes_provider_country"           "provider_type"                   
## [15] "medicare_participation_indicator" "place_of_service"                
## [17] "hcpcs_code"                       "hcpcs_description"               
## [19] "line_srvc_cnt"                    "bene_unique_cnt"                 
## [21] "bene_day_srvc_cnt"                "average_Medicare_allowed_amt"    
## [23] "stdev_Medicare_allowed_amt"       "average_submitted_chrg_amt"      
## [25] "stdev_submitted_chrg_amt"         "average_Medicare_payment_amt"    
## [27] "stdev_Medicare_payment_amt"

Remembering that we want records coming from Indiana providers, it is field 12 that contains the state information.

Here is the function we want to apply to a line.

line_select <- function(x) {
    t <- textConnection(x)
    fields <- scan(t, what = "character", sep = "\t")
    if (fields[12] == "IN") {
        return(fields)
    }
}

We'll use foreach to iterate over the lines, split them into fields, return the Indiana records, and combine into a table (matrix or data.frame).

This chunk prepares a subset of 1000 records to run in class.

con_obj <- file("../Data/Medicare-Physician-and-Other-Supplier-PUF-CY2012/Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt", 
    "r")
con_line_iterator <- ireadLines(con_obj, n = 1000)
sample_med_data <- nextElem(con_line_iterator)
cat(sample_med_data, file = "sample_record.txt", sep = "\n")
con_obj_sample <- file("sample_record.txt", "r")
con_line_iterator_sample <- ireadLines(con_obj_sample, n = 1)
ind_providers_df <- foreach(r = con_line_iterator_sample, .combine = "rbind") %do% 
    line_select(r)

Now inspect the result.

class(ind_providers_df)
## [1] "matrix"
ind_providers_df[1:3, ]
##            [,1]         [,2]              [,3]     [,4] [,5]   [,6] [,7]
## result.539 "1003005208" "YANGA-BARKSDALE" "THELMA" "R"  "P.T." "F"  "I" 
## result.540 "1003005208" "YANGA-BARKSDALE" "THELMA" "R"  "P.T." "F"  "I" 
## result.541 "1003005208" "YANGA-BARKSDALE" "THELMA" "R"  "P.T." "F"  "I" 
##            [,8]                [,9]      [,10]         [,11]       [,12]
## result.539 "2723 S 7TH STREET" "SUITE A" "TERRE HAUTE" "478023558" "IN" 
## result.540 "2723 S 7TH STREET" "SUITE A" "TERRE HAUTE" "478023558" "IN" 
## result.541 "2723 S 7TH STREET" "SUITE A" "TERRE HAUTE" "478023558" "IN" 
##            [,13] [,14]                [,15] [,16] [,17]  
## result.539 "US"  "Physical Therapist" "Y"   "O"   "97001"
## result.540 "US"  "Physical Therapist" "Y"   "O"   "97012"
## result.541 "US"  "Physical Therapist" "Y"   "O"   "97035"
##            [,18]                         [,19] [,20] [,21] [,22]         
## result.539 "Pt evaluation"               "146" "142" "146" "70.439589041"
## result.540 "Mechanical traction therapy" "76"  "13"  "76"  "14.071578947"
## result.541 "Ultrasound therapy"          "281" "52"  "281" "10.88"       
##            [,23]          [,24]         [,25]          [,26]         
## result.539 "0.3467439083" "135.4109589" "16.441124196" "52.878013699"
## result.540 "0.5548377018" "28"          "0"            "9.5806578947"
## result.541 "0"            "22"          "0"            "8.2209608541"
##            [,27]         
## result.539 "12.042286659"
## result.540 "3.4653200412"
## result.541 "1.716333343"
dim(ind_providers_df)
## [1] 13 27

Only 13 records for Indiana. We kept the memory requirements low. To read all 9 million would take a long time, but it is doable with any file size.

Getting data over the Web with an API

Here we'll get some articles from the NYTimes using my API.

Here is the URL we need.

url <- "http://api.nytimes.com/svc/mostpopular/v2/mostviewed/all-sections/7.json?api-key=ec90a5207691967254be2ab415a57aee:8:42005440"
raw_data <- readLines(url)
save(raw_data, file = "raw_data.RData")
load("raw_data.RData")
class(raw_data)
## [1] "character"
length(raw_data)
## [1] 1

This data is in a JSON (Javascript Object Notation) format. Happily, there is a package for that,

library(rjson)
article_data <- fromJSON(json_str = raw_data)
class(article_data)
## [1] "list"
length(article_data)
## [1] 4
class(article_data[[1]])
## [1] "character"
length(article_data[[1]])
## [1] 1
article_data[[1]]
## [1] "OK"
article_data[[2]]
## [1] "Copyright (c) 2014 The New York Times Company.  All Rights Reserved."
article_data[[3]]
## [1] 926
article_data[[4]]

List entry 4 contains a list of all the articles, properly divided into fields.