Since R is all about analyzing data, it isn't surpising that a central object in R is the data frame, an object for storing tabular data.
A typical way to get data into R is to read in a .csv file, which R makes very easy.
setwd("~/Documents/Computing with Data/5_Data_frames/")
# This is where I'm working for this lecture
df1 <- read.csv(file = "Data/metabric_published_clinical.csv")
class(df1)
## [1] "data.frame"
dim(df1) # dimensions of the table
## [1] 1992 8
The structure
function, abbreviated str
gives a good summary of the table
str(df1)
## 'data.frame': 1992 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : Factor w/ 1992 levels "MB-0000","MB-0002",..: 2 5 6 12 13 18 22 23 26 37 ...
## $ age_at_diagnosis : num 43.2 77 78.8 84.2 85.5 ...
## $ last_follow_up_status : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
## $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
## $ grade : Factor w/ 4 levels "1","2","3","null": 3 3 3 2 2 2 3 3 2 2 ...
## $ size : Factor w/ 114 levels "0","1","10","100",..: 3 76 64 59 48 68 28 51 24 69 ...
## $ stage : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...
The data weren't all imported in the mode we'd like. There are a number of ways to fix that, but first some basics.
While a data.frame may look like a matrix, formally it is a special kind of list. This enables it to have columns of numeric and character mode. The components of the list are the columns, which contain the sample characterics with respect to a fixed variable. In effect a data frame is a list in which each component is an atomic vector and all have the same length. It measures the properties of the table like a matrix, though.
names(df1) # These are the column (variable) names
## [1] "X" "sample"
## [3] "age_at_diagnosis" "last_follow_up_status"
## [5] "menopausal_status_inferred" "grade"
## [7] "size" "stage"
dim(df1)
## [1] 1992 8
nrow(df1)
## [1] 1992
ncol(df1)
## [1] 8
In keeping with the list ancestry, columns can be extracted with the \$ notation.
df1$sample[1:5]
## [1] MB-0002 MB-0008 MB-0010 MB-0035 MB-0036
## 1992 Levels: MB-0000 MB-0002 MB-0005 MB-0006 MB-0008 MB-0010 ... MB-7299
(Note that it listed the levels because this is a factor.)
It is also possible to construct or extend a data.frame using list-like operations.
v1 <- c(1.2, 1.4, 2)
g <- c("M", "F", "M")
df_small <- data.frame(V = v1, Gender = g)
df_small
## V Gender
## 1 1.2 M
## 2 1.4 F
## 3 2.0 M
ht <- c(72, 63, 70)
df_med <- data.frame(df_small, Height = ht)
df_med
## V Gender Height
## 1 1.2 M 72
## 2 1.4 F 63
## 3 2.0 M 70
names(df_med)
## [1] "V" "Gender" "Height"
Finally, it may help to assign row names to really identify a record index for all of the values. Rownames are character vectors.
rownames(df_med)
## [1] "1" "2" "3"
rownames(df_med) <- c("S1", "S2", "S3")
df_med
## V Gender Height
## S1 1.2 M 72
## S2 1.4 F 63
## S3 2.0 M 70
In the clinical data, the record index is stored inside the data.frame in the sample variable.
There are multiple things wrong with how R set the modes of the variables in df1.
Problem: Change the mode of size and sample in a copy, df2 of df1.
df2 <- df1
df2$size <- as.numeric(df2$size)
df2$sample <- as.character(df2$sample)
str(df2)
## 'data.frame': 1992 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : chr "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
## $ age_at_diagnosis : num 43.2 77 78.8 84.2 85.5 ...
## $ last_follow_up_status : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
## $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
## $ grade : Factor w/ 4 levels "1","2","3","null": 3 3 3 2 2 2 3 3 2 2 ...
## $ size : num 3 76 64 59 48 68 28 51 24 69 ...
## $ stage : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...
There is a more subtle problem with grade. The possible values are "1", "2", "3" and "null".
table(df1$grade)
##
## 1 2 3 null
## 170 775 957 90
Having null as a possible value will mess up statistical analyses using grade. We need to convert it to NA.
Indexing columns and rows with integers works just as for matrices. The same with names and rownames.
df2[1:5, c("sample", "age_at_diagnosis", "last_follow_up_status")]
## sample age_at_diagnosis last_follow_up_status
## 1 MB-0002 43.19 a
## 2 MB-0008 76.97 d-d.s.
## 3 MB-0010 78.77 d-d.s.
## 4 MB-0035 84.22 d
## 5 MB-0036 85.49 a
Of course, we can use a logical vector in the row or names slot, although it is rarely used in the names slot.
# logical to select the odd rows up until 9.
log_sel <- rep(FALSE, times = 1992)
log_sel[c(1, 3, 5, 7, 9)] <- TRUE
df_sub1 <- df2[log_sel, ]
df_sub1
## X sample age_at_diagnosis last_follow_up_status
## 1 1 MB-0002 43.19 a
## 3 3 MB-0010 78.77 d-d.s.
## 5 5 MB-0036 85.49 a
## 7 7 MB-0059 75.58 a
## 9 9 MB-0066 61.49 a
## menopausal_status_inferred grade size stage
## 1 pre 3 3 1
## 3 post 3 64 4
## 5 post 2 48 4
## 7 post 3 28 1
## 9 post 2 24 2
Logicals come mostly from equations and comparisons.
summary(df2$size)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 33.0 51.0 49.7 63.0 114.0
# Select the data for the patients with tumors < 40 mm
df3 <- df2[df2$size < 40, ]
# assign 0 to the age column for all entries with last_follow_up_status =
# 'a'
df3[df3$last_follow_up_status == "a" & !(is.na(df3$last_follow_up_status)),
"age_at_diagnosis"] <- 0
# Notice we had to exclude missing values or we'll get an error
Problem: Replace the "null" values in grade by NAs in df2.
df4 <- df2
df4[df4$grade == "null", "grade"] <- NA
table(df4$grade)
##
## 1 2 3 null
## 170 775 957 0
"null" is still in the data because it's a level of the factor. Get rid of it.
df4$grade <- factor(as.character(df4$grade))
table(df4$grade)
##
## 1 2 3
## 170 775 957
sum(is.na(df4$grade))
## [1] 90
Check the options on read.csv in help. There are 2 things that can help avoid post-loading cleanup.
clin_df1 <- read.csv(file = "Data/metabric_published_clinical.csv", stringsAsFactors = FALSE)
str(clin_df1)
## 'data.frame': 1992 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : chr "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
## $ age_at_diagnosis : num 43.2 77 78.8 84.2 85.5 ...
## $ last_follow_up_status : chr "a" "d-d.s." "d-d.s." "d" ...
## $ menopausal_status_inferred: chr "pre" "post" "post" "post" ...
## $ grade : chr "3" "3" "3" "2" ...
## $ size : chr "10" "40" "31" "28" ...
## $ stage : chr "1" "2" "4" "2" ...
sum(is.na(clin_df1$size))
## [1] 0
sum(is.null(clin_df1$size))
## [1] 0
sum(clin_df1$size == "null")
## [1] 20
This is why it got read in as a character. What happened when we converted to a numeric earlier.
as.numeric(c("1", "3", "null"))
## Warning: NAs introduced by coercion
## [1] 1 3 NA
# Just as we wanted, in this case
We still got size as a character instead of a number. We can also specify the classes of what's input.
desired_class <- c("integer", "character", "numeric", "factor", "character",
"character", "character", "character")
clin_df2 <- read.csv(file = "Data/metabric_published_clinical.csv", colClasses = desired_class)
str(clin_df2)
## 'data.frame': 1992 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : chr "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
## $ age_at_diagnosis : num 43.2 77 78.8 84.2 85.5 ...
## $ last_follow_up_status : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
## $ menopausal_status_inferred: chr "pre" "post" "post" "post" ...
## $ grade : chr "3" "3" "3" "2" ...
## $ size : chr "10" "40" "31" "28" ...
## $ stage : chr "1" "2" "4" "2" ...
Some spreadsheets you deal with will be big. Here is on-time arrival data for US flights, January, 2012 from the Bureau of Labor Statistics.
arriv_df <- read.csv("Data/flight_ontime_status.csv")
system.time(arriv_df <- read.csv("Data/flight_ontime_status.csv"))
## user system elapsed
## 3.069 0.116 3.185
dim(arriv_df)
## [1] 486133 21
names(arriv_df)
## [1] "ORIGIN_AIRPORT_ID" "ORIGIN_AIRPORT_SEQ_ID"
## [3] "ORIGIN_CITY_MARKET_ID" "ORIGIN_STATE_ABR"
## [5] "DEST_AIRPORT_ID" "DEST_AIRPORT_SEQ_ID"
## [7] "DEST_CITY_MARKET_ID" "DEST_STATE_ABR"
## [9] "DEP_TIME" "DEP_DELAY"
## [11] "DEP_DELAY_NEW" "ARR_TIME"
## [13] "ARR_DELAY" "ARR_DELAY_NEW"
## [15] "DISTANCE" "CARRIER_DELAY"
## [17] "WEATHER_DELAY" "NAS_DELAY"
## [19] "SECURITY_DELAY" "LATE_AIRCRAFT_DELAY"
## [21] "X"
print(object.size(arriv_df), units = "MB")
## 57.5 Mb
A lot of load time is taken up by scanning columns to see what the mode should be. If we specify that in the read call we can speed things up quite a bit. Since we don't know the variable modes ahead of time, lets set them
header_classes <- rep("character", times = 21)
Read it again with the specified column classes and time it.
system.time(arriv_df2 <- read.csv("Data/flight_ontime_status.csv", colClasses = header_classes))
## user system elapsed
## 2.445 0.110 2.555
The [,] way of restricting a data.frame is the "official" way but it gets hard to read. The function subset
gives an alternative that is less heavy in special symbols.
clinical_grade1 <- subset(df4, grade == "1")
clinical_grade2 <- subset(df4, grade == "1" & size > 20)
The expression you put in the second component of subset
needs to return a logical. There is a value of that variable for each row and you compute the truth value of the formula for that row, selecting the row for inclusion in the sub-data.frame if it's TRUE.
This just restricts the entire dataset. We can also restrict to specific columns.
clin3 <- subset(df4, select = c(sample, grade, size))
head(clin3)
## sample grade size
## 1 MB-0002 3 3
## 2 MB-0008 3 76
## 3 MB-0010 3 64
## 4 MB-0035 2 59
## 5 MB-0036 2 48
## 6 MB-0050 2 68
clin3 <- subset(df4, grade == "1" & size > 20, select = c(sample, grade, size))
str(clin3)
## 'data.frame': 120 obs. of 3 variables:
## $ sample: chr "MB-0106" "MB-0122" "MB-0133" "MB-0155" ...
## $ grade : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : num 103 42 108 111 83 57 111 84 28 48 ...
Frequently, when working with data we need to add new variables. There are a couple of ways to do this.
We need to operate on the size variable and insert the new variable as a new column.
str(df4)
## 'data.frame': 1992 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : chr "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
## $ age_at_diagnosis : num 43.2 77 78.8 84.2 85.5 ...
## $ last_follow_up_status : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
## $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
## $ grade : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 3 3 2 2 ...
## $ size : num 3 76 64 59 48 68 28 51 24 69 ...
## $ stage : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...
size_var <- df4$size
bin_size <- cut(size_var, breaks = c(-Inf, 20, Inf), labels = c("small", "large"))
table(bin_size)
## bin_size
## small large
## 351 1641
Now use data.frame to add this.
df5 <- data.frame(df4, BIN_SIZE = bin_size)
It is critical here that you don't change the order of any of the variables you removed from the data.frame. Otherwise, the new variable won't match the right sample.
The function transform
lets you do all of this at once. It lets you compute on variables in the dataframe and add a new variable at once.
df6 <- transform(df4, BIN_SIZE = cut(size, breaks = c(-Inf, 20, Inf), labels = c("small",
"large")))
table(df6$BIN_SIZE)
##
## small large
## 351 1641
One of the important uses of transform
is to make sure the variable you're adding is ordered correctly. If a new variable, say, treatment
has samples as names, you can add as
df7 <- transform(df5, TRTMNT = treatment[sample])
and be sure the values added match the right samples.
You should look up the help and thoroughly understand the use of
rbind
for adding new rows to a dataframeunique
for removing duplicate rowsmerge
for joining to dataframes by matching rows using a selected column (like sample
).