Introduction to data frames

Since R is all about analyzing data, it isn't surpising that a central object in R is the data frame, an object for storing tabular data.

A typical way to get data into R is to read in a .csv file, which R makes very easy.

setwd("~/Documents/Computing with Data/5_Data_frames/")
# This is where I'm working for this lecture
df1 <- read.csv(file = "Data/metabric_published_clinical.csv")
class(df1)
## [1] "data.frame"
dim(df1)  # dimensions of the table
## [1] 1992    8

The structure function, abbreviated str gives a good summary of the table

str(df1)
## 'data.frame':    1992 obs. of  8 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : Factor w/ 1992 levels "MB-0000","MB-0002",..: 2 5 6 12 13 18 22 23 26 37 ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
##  $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
##  $ grade                     : Factor w/ 4 levels "1","2","3","null": 3 3 3 2 2 2 3 3 2 2 ...
##  $ size                      : Factor w/ 114 levels "0","1","10","100",..: 3 76 64 59 48 68 28 51 24 69 ...
##  $ stage                     : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...

The data weren't all imported in the mode we'd like. There are a number of ways to fix that, but first some basics.

Names and measurements

While a data.frame may look like a matrix, formally it is a special kind of list. This enables it to have columns of numeric and character mode. The components of the list are the columns, which contain the sample characterics with respect to a fixed variable. In effect a data frame is a list in which each component is an atomic vector and all have the same length. It measures the properties of the table like a matrix, though.

names(df1)  # These are the column (variable) names
## [1] "X"                          "sample"                    
## [3] "age_at_diagnosis"           "last_follow_up_status"     
## [5] "menopausal_status_inferred" "grade"                     
## [7] "size"                       "stage"
dim(df1)
## [1] 1992    8
nrow(df1)
## [1] 1992
ncol(df1)
## [1] 8

In keeping with the list ancestry, columns can be extracted with the \$ notation.

df1$sample[1:5]
## [1] MB-0002 MB-0008 MB-0010 MB-0035 MB-0036
## 1992 Levels: MB-0000 MB-0002 MB-0005 MB-0006 MB-0008 MB-0010 ... MB-7299

(Note that it listed the levels because this is a factor.)

It is also possible to construct or extend a data.frame using list-like operations.

v1 <- c(1.2, 1.4, 2)
g <- c("M", "F", "M")
df_small <- data.frame(V = v1, Gender = g)
df_small
##     V Gender
## 1 1.2      M
## 2 1.4      F
## 3 2.0      M
ht <- c(72, 63, 70)
df_med <- data.frame(df_small, Height = ht)
df_med
##     V Gender Height
## 1 1.2      M     72
## 2 1.4      F     63
## 3 2.0      M     70
names(df_med)
## [1] "V"      "Gender" "Height"

Finally, it may help to assign row names to really identify a record index for all of the values. Rownames are character vectors.

rownames(df_med)
## [1] "1" "2" "3"
rownames(df_med) <- c("S1", "S2", "S3")
df_med
##      V Gender Height
## S1 1.2      M     72
## S2 1.4      F     63
## S3 2.0      M     70

In the clinical data, the record index is stored inside the data.frame in the sample variable.

Changing variable mode

There are multiple things wrong with how R set the modes of the variables in df1.

  1. sample should be a character vector indexing the records (patients). It isn't a categorical variable. There will never be two sample values the same.
  2. size is a numeric measurement of the size of the tumor in mm. It should be converted to a numeric vector.

Problem: Change the mode of size and sample in a copy, df2 of df1.

df2 <- df1
df2$size <- as.numeric(df2$size)
df2$sample <- as.character(df2$sample)
str(df2)
## 'data.frame':    1992 obs. of  8 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : chr  "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
##  $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
##  $ grade                     : Factor w/ 4 levels "1","2","3","null": 3 3 3 2 2 2 3 3 2 2 ...
##  $ size                      : num  3 76 64 59 48 68 28 51 24 69 ...
##  $ stage                     : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...

There is a more subtle problem with grade. The possible values are "1", "2", "3" and "null".

table(df1$grade)
## 
##    1    2    3 null 
##  170  775  957   90

Having null as a possible value will mess up statistical analyses using grade. We need to convert it to NA.

Indexing and subsetting

Indexing columns and rows with integers works just as for matrices. The same with names and rownames.

df2[1:5, c("sample", "age_at_diagnosis", "last_follow_up_status")]
##    sample age_at_diagnosis last_follow_up_status
## 1 MB-0002            43.19                     a
## 2 MB-0008            76.97                d-d.s.
## 3 MB-0010            78.77                d-d.s.
## 4 MB-0035            84.22                     d
## 5 MB-0036            85.49                     a

Of course, we can use a logical vector in the row or names slot, although it is rarely used in the names slot.

# logical to select the odd rows up until 9.
log_sel <- rep(FALSE, times = 1992)
log_sel[c(1, 3, 5, 7, 9)] <- TRUE
df_sub1 <- df2[log_sel, ]
df_sub1
##   X  sample age_at_diagnosis last_follow_up_status
## 1 1 MB-0002            43.19                     a
## 3 3 MB-0010            78.77                d-d.s.
## 5 5 MB-0036            85.49                     a
## 7 7 MB-0059            75.58                     a
## 9 9 MB-0066            61.49                     a
##   menopausal_status_inferred grade size stage
## 1                        pre     3    3     1
## 3                       post     3   64     4
## 5                       post     2   48     4
## 7                       post     3   28     1
## 9                       post     2   24     2

Logicals come mostly from equations and comparisons.

summary(df2$size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    33.0    51.0    49.7    63.0   114.0
# Select the data for the patients with tumors < 40 mm
df3 <- df2[df2$size < 40, ]
# assign 0 to the age column for all entries with last_follow_up_status =
# 'a'
df3[df3$last_follow_up_status == "a" & !(is.na(df3$last_follow_up_status)), 
    "age_at_diagnosis"] <- 0
# Notice we had to exclude missing values or we'll get an error

Problem: Replace the "null" values in grade by NAs in df2.

df4 <- df2
df4[df4$grade == "null", "grade"] <- NA
table(df4$grade)
## 
##    1    2    3 null 
##  170  775  957    0

"null" is still in the data because it's a level of the factor. Get rid of it.

df4$grade <- factor(as.character(df4$grade))
table(df4$grade)
## 
##   1   2   3 
## 170 775 957
sum(is.na(df4$grade))
## [1] 90

Controlling the classes when reading it in

Check the options on read.csv in help. There are 2 things that can help avoid post-loading cleanup.

clin_df1 <- read.csv(file = "Data/metabric_published_clinical.csv", stringsAsFactors = FALSE)
str(clin_df1)
## 'data.frame':    1992 obs. of  8 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : chr  "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : chr  "a" "d-d.s." "d-d.s." "d" ...
##  $ menopausal_status_inferred: chr  "pre" "post" "post" "post" ...
##  $ grade                     : chr  "3" "3" "3" "2" ...
##  $ size                      : chr  "10" "40" "31" "28" ...
##  $ stage                     : chr  "1" "2" "4" "2" ...
sum(is.na(clin_df1$size))
## [1] 0
sum(is.null(clin_df1$size))
## [1] 0
sum(clin_df1$size == "null")
## [1] 20

This is why it got read in as a character. What happened when we converted to a numeric earlier.

as.numeric(c("1", "3", "null"))
## Warning: NAs introduced by coercion
## [1]  1  3 NA
# Just as we wanted, in this case

We still got size as a character instead of a number. We can also specify the classes of what's input.

desired_class <- c("integer", "character", "numeric", "factor", "character", 
    "character", "character", "character")
clin_df2 <- read.csv(file = "Data/metabric_published_clinical.csv", colClasses = desired_class)
str(clin_df2)
## 'data.frame':    1992 obs. of  8 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : chr  "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
##  $ menopausal_status_inferred: chr  "pre" "post" "post" "post" ...
##  $ grade                     : chr  "3" "3" "3" "2" ...
##  $ size                      : chr  "10" "40" "31" "28" ...
##  $ stage                     : chr  "1" "2" "4" "2" ...

Reading large spreadsheets

Some spreadsheets you deal with will be big. Here is on-time arrival data for US flights, January, 2012 from the Bureau of Labor Statistics.

arriv_df <- read.csv("Data/flight_ontime_status.csv")
system.time(arriv_df <- read.csv("Data/flight_ontime_status.csv"))
##    user  system elapsed 
##   3.069   0.116   3.185
dim(arriv_df)
## [1] 486133     21
names(arriv_df)
##  [1] "ORIGIN_AIRPORT_ID"     "ORIGIN_AIRPORT_SEQ_ID"
##  [3] "ORIGIN_CITY_MARKET_ID" "ORIGIN_STATE_ABR"     
##  [5] "DEST_AIRPORT_ID"       "DEST_AIRPORT_SEQ_ID"  
##  [7] "DEST_CITY_MARKET_ID"   "DEST_STATE_ABR"       
##  [9] "DEP_TIME"              "DEP_DELAY"            
## [11] "DEP_DELAY_NEW"         "ARR_TIME"             
## [13] "ARR_DELAY"             "ARR_DELAY_NEW"        
## [15] "DISTANCE"              "CARRIER_DELAY"        
## [17] "WEATHER_DELAY"         "NAS_DELAY"            
## [19] "SECURITY_DELAY"        "LATE_AIRCRAFT_DELAY"  
## [21] "X"
print(object.size(arriv_df), units = "MB")
## 57.5 Mb

A lot of load time is taken up by scanning columns to see what the mode should be. If we specify that in the read call we can speed things up quite a bit. Since we don't know the variable modes ahead of time, lets set them

header_classes <- rep("character", times = 21)

Read it again with the specified column classes and time it.

system.time(arriv_df2 <- read.csv("Data/flight_ontime_status.csv", colClasses = header_classes))
##    user  system elapsed 
##   2.445   0.110   2.555

A better way to restrict a data frame

The [,] way of restricting a data.frame is the "official" way but it gets hard to read. The function subset gives an alternative that is less heavy in special symbols.

clinical_grade1 <- subset(df4, grade == "1")
clinical_grade2 <- subset(df4, grade == "1" & size > 20)

The expression you put in the second component of subset needs to return a logical. There is a value of that variable for each row and you compute the truth value of the formula for that row, selecting the row for inclusion in the sub-data.frame if it's TRUE.

This just restricts the entire dataset. We can also restrict to specific columns.

clin3 <- subset(df4, select = c(sample, grade, size))
head(clin3)
##    sample grade size
## 1 MB-0002     3    3
## 2 MB-0008     3   76
## 3 MB-0010     3   64
## 4 MB-0035     2   59
## 5 MB-0036     2   48
## 6 MB-0050     2   68
clin3 <- subset(df4, grade == "1" & size > 20, select = c(sample, grade, size))
str(clin3)
## 'data.frame':    120 obs. of  3 variables:
##  $ sample: chr  "MB-0106" "MB-0122" "MB-0133" "MB-0155" ...
##  $ grade : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size  : num  103 42 108 111 83 57 111 84 28 48 ...

Adding new variables

Frequently, when working with data we need to add new variables. There are a couple of ways to do this.

Add a binary variable selecting the tumors larger or smaller than 20 mm

We need to operate on the size variable and insert the new variable as a new column.

str(df4)
## 'data.frame':    1992 obs. of  8 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : chr  "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
##  $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
##  $ grade                     : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 3 3 2 2 ...
##  $ size                      : num  3 76 64 59 48 68 28 51 24 69 ...
##  $ stage                     : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...
size_var <- df4$size
bin_size <- cut(size_var, breaks = c(-Inf, 20, Inf), labels = c("small", "large"))
table(bin_size)
## bin_size
## small large 
##   351  1641

Now use data.frame to add this.

df5 <- data.frame(df4, BIN_SIZE = bin_size)

It is critical here that you don't change the order of any of the variables you removed from the data.frame. Otherwise, the new variable won't match the right sample.

The function transform lets you do all of this at once. It lets you compute on variables in the dataframe and add a new variable at once.

df6 <- transform(df4, BIN_SIZE = cut(size, breaks = c(-Inf, 20, Inf), labels = c("small", 
    "large")))
table(df6$BIN_SIZE)
## 
## small large 
##   351  1641

One of the important uses of transform is to make sure the variable you're adding is ordered correctly. If a new variable, say, treatment has samples as names, you can add as

df7 <- transform(df5, TRTMNT = treatment[sample])

and be sure the values added match the right samples.

Other important operations

You should look up the help and thoroughly understand the use of