“Data munging” is an informal term describing the process of cleaning up raw data in preparation for exploratory data analysis, statistical analysis, visualization, etc. This has become an important part of data science and the applied statistics of complex data. Here, we give an example of the types of transformations needed to clean up data, getting practice with R functions for working with data.frames along the way.

Following is the clinical data on about 2000 breast cancer patients [Curtis C, Shah SP, Chin SF et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012; 486: 346-352]. This is known as the Metabric cohort of patients.

clin_data_df1 <- read.csv(file = "../Data/metabric_published_clinical_df.csv")
str(clin_data_df1)
## 'data.frame':    1992 obs. of  29 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : Factor w/ 1992 levels "MB-0000","MB-0002",..: 2 5 6 12 13 18 22 23 26 37 ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : Factor w/ 4 levels "a","d","d-d.s.",..: 1 3 3 2 1 1 1 1 1 1 ...
##  $ menopausal_status_inferred: Factor w/ 3 levels "null","post",..: 3 2 2 2 2 3 2 3 2 3 ...
##  $ group                     : Factor w/ 5 levels "1","2","3","4",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ grade                     : Factor w/ 4 levels "1","2","3","null": 3 3 3 2 2 2 3 3 2 2 ...
##  $ size                      : Factor w/ 114 levels "0","1","10","100",..: 3 76 64 59 48 68 28 51 24 69 ...
##  $ stage                     : Factor w/ 6 levels "0","1","2","3",..: 2 3 5 3 5 3 2 3 3 3 ...
##  $ lymph_nodes_positive      : Factor w/ 32 levels "0","1","10","11",..: 1 30 1 1 1 21 1 1 2 1 ...
##  $ lymph_nodes_removed       : Factor w/ 49 levels "0","1","10","11",..: 19 47 7 9 17 10 20 6 13 7 ...
##  $ NPI                       : num  4.02 6.08 4.06 3.06 3.04 4.07 4.03 4.05 4.03 3.07 ...
##  $ histological_type         : Factor w/ 14 levels "BENIGN","DCIS",..: 3 3 3 8 3 8 3 3 3 8 ...
##  $ ER_IHC_status             : Factor w/ 3 levels "neg","null","pos": 3 3 3 1 3 3 3 3 3 3 ...
##  $ HER2_IHC_status           : Factor w/ 5 levels "0","1","2","3",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ HER2_SNP6_state           : Factor w/ 4 levels "GAIN","LOSS",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ cellularity               : Factor w/ 4 levels "high","low","moderate",..: 1 1 3 1 3 3 1 1 1 3 ...
##  $ P53_mutation_status       : Factor w/ 2 levels "MUT","WT": 1 2 2 2 2 1 2 1 2 2 ...
##  $ P53_mutation_type         : Factor w/ 4 levels "FRAMESHIFT","MISSENSE",..: 2 NA NA NA NA 2 NA 2 NA NA ...
##  $ P53_mutation_details      : Factor w/ 99 levels "MB-AD-0002+ex5,6+chr17:7519122+12521A>AC+178H>H/P+MISSENSE+FR+FR",..: 1 NA NA NA NA 2 NA 3 NA NA ...
##  $ Pam50Subtype              : Factor w/ 6 levels "Basal","Her2",..: 3 4 4 2 3 6 3 4 4 6 ...
##  $ IntClustMemb              : int  4 9 7 3 3 8 8 10 7 8 ...
##  $ Treatment                 : Factor w/ 8 levels "CT","CT/HT","CT/HT/RT",..: 6 3 6 7 6 3 6 3 6 6 ...
##  $ Site                      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Genefu                    : Factor w/ 4 levels "ER-/HER2-","ER+/HER2- High Prolif",..: 2 2 2 2 3 3 2 2 2 3 ...
##  $ ER.Expr                   : Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Her2.Expr                 : Factor w/ 2 levels "-","+": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PR.Expr                   : Factor w/ 2 levels "-","+": 2 2 2 1 2 2 2 2 2 2 ...
##  $ T                         : int  1484 1241 234 1088 2314 1784 1718 1846 1543 2713 ...

Factor of character data

Almost all variables were imported as factors. R will import numeric (and integer) data as such when there are no non-numeric entries. Any column containing a character will be imported as a factor. This is often the wrong interpretation. Moreover, if a column is supposed to be numericand ended up as a factor, cleaning it up is harder than if it were a character. For these reasons, I prefer not to import character data as factors, which is done as follows.

clin_data_df2 <- read.csv(file = "../Data/metabric_published_clinical_df.csv", 
                          stringsAsFactors = FALSE)
str(clin_data_df2)
## 'data.frame':    1992 obs. of  29 variables:
##  $ X                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample                    : chr  "MB-0002" "MB-0008" "MB-0010" "MB-0035" ...
##  $ age_at_diagnosis          : num  43.2 77 78.8 84.2 85.5 ...
##  $ last_follow_up_status     : chr  "a" "d-d.s." "d-d.s." "d" ...
##  $ menopausal_status_inferred: chr  "pre" "post" "post" "post" ...
##  $ group                     : chr  "4" "4" "4" "4" ...
##  $ grade                     : chr  "3" "3" "3" "2" ...
##  $ size                      : chr  "10" "40" "31" "28" ...
##  $ stage                     : chr  "1" "2" "4" "2" ...
##  $ lymph_nodes_positive      : chr  "0" "8" "0" "0" ...
##  $ lymph_nodes_removed       : chr  "25" "8" "14" "16" ...
##  $ NPI                       : num  4.02 6.08 4.06 3.06 3.04 4.07 4.03 4.05 4.03 3.07 ...
##  $ histological_type         : chr  "IDC" "IDC" "IDC" "ILC" ...
##  $ ER_IHC_status             : chr  "pos" "pos" "pos" "neg" ...
##  $ HER2_IHC_status           : chr  "null" "null" "null" "null" ...
##  $ HER2_SNP6_state           : chr  "NEUT" "NEUT" "NEUT" "LOSS" ...
##  $ cellularity               : chr  "high" "high" "moderate" "high" ...
##  $ P53_mutation_status       : chr  "MUT" "WT" "WT" "WT" ...
##  $ P53_mutation_type         : chr  "MISSENSE" NA NA NA ...
##  $ P53_mutation_details      : chr  "MB-AD-0002+ex5,6+chr17:7519122+12521A>AC+178H>H/P+MISSENSE+FR+FR" NA NA NA ...
##  $ Pam50Subtype              : chr  "LumA" "LumB" "LumB" "Her2" ...
##  $ IntClustMemb              : int  4 9 7 3 3 8 8 10 7 8 ...
##  $ Treatment                 : chr  "HT/RT" "CT/HT/RT" "HT/RT" "NONE" ...
##  $ Site                      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Genefu                    : chr  "ER+/HER2- High Prolif" "ER+/HER2- High Prolif" "ER+/HER2- High Prolif" "ER+/HER2- High Prolif" ...
##  $ ER.Expr                   : chr  "+" "+" "+" "+" ...
##  $ Her2.Expr                 : chr  "-" "-" "-" "-" ...
##  $ PR.Expr                   : chr  "+" "+" "+" "-" ...
##  $ T                         : int  1484 1241 234 1088 2314 1784 1718 1846 1543 2713 ...

Fix missing data codes

The fact that tumor size (in millimeters) was imported as a character indicates there is are some character values in the column. This is a mistake we need to fix.

Inspect the data as cleanly as possible.

unique(clin_data_df2$size)
##   [1] "10"    "40"    "31"    "28"    "22"    "33"    "17"    "23"   
##   [9] "16"    "34"    "70"    "150"   "25"    "26"    "30"    "20"   
##  [17] "35"    "39"    "80"    "13"    "18"    "27"    "19"    "50"   
##  [25] "12"    "9"     "14"    "52"    "24"    "38"    "44"    "36"   
##  [33] "21"    "48"    "3"     "46"    "15"    "11"    "55"    "29"   
##  [41] "47"    "32"    "45"    "67"    "43"    "42"    "180"   "60"   
##  [49] "57"    "100"   "65"    "37"    "90"    "8"     "160"   "84"   
##  [57] "5.5"   "49"    "99"    "68"    "41"    "5"     "2"     "75"   
##  [65] "51"    "null"  "53"    "0"     "4"     "130"   "62"    "1"    
##  [73] "7"     "6"     "120"   "61"    "79"    "71"    "22.5"  "17.9" 
##  [81] "14.5"  "12.8"  "18.5"  "15.5"  "21.5"  "16.9"  "24.4"  "12.5" 
##  [89] "40.3"  "11.8"  "32.6"  "17.2"  "13.8"  "15.7"  "182"   "85"   
##  [97] "18.3"  "21.6"  "28.5"  "16.2"  "2.3"   "15.2"  "31.1"  "14.3" 
## [105] "12.6"  "25.1"  "17.6"  "2.12"  "21.3"  "22.32" "17.7"  "15.47"
## [113] "24.15" "20.5"

Missing values were coded as “null” instead of NA. Let’s fix this.

# Alter the size data in place
clin_data_df2[clin_data_df2$size == "null", "size"] <- NA
# how many missing values are there?
sum(is.na(clin_data_df2$size))
## [1] 20
# check if the null codes are gone
sum(clin_data_df2$size == "null")
## [1] NA
# Need to exclude NA values in the sum
sum(clin_data_df2$size == "null", na.rm = TRUE)
## [1] 0

Now that the character data is gone, let’s change size to a numeric variable.

clin_data_df2$size <- as.numeric(clin_data_df2$size)
# Check the distribution
summary(clin_data_df2$size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    17.0    23.0    26.2    30.0   182.0      20

Create factors when appropriate

When character data really does represent a categorical variable, it makes sense to make it into a factor. One such variable here is last follow up status.

# Inspect the values
table(clin_data_df2$last_follow_up_status)
## 
##      a      d d-d.s. d-o.c. 
##   1081    119    506    266
# Is NA coded properly if there are any?
sum(is.na(clin_data_df2$last_follow_up_status))
## [1] 20

Now just convert to a factor

clin_data_df2$last_follow_up_status <- factor(clin_data_df2$last_follow_up_status)
summary(clin_data_df2$last_follow_up_status)
##      a      d d-d.s. d-o.c.   NA's 
##   1081    119    506    266     20
levels(clin_data_df2$last_follow_up_status)
## [1] "a"      "d"      "d-d.s." "d-o.c."

Summary

Many similar transformations will need to be done to put the data into a form suitable for analysis. There are R programming tools that can help doing similar things for a lot of variables. In any case, spending the time to prepare the data prior to analysis pays off in the end by helping to ensure meaningful results.