Goals for the Computing with Data Seminar

There are many goals for this seminar. You need to become better R programmers for one thing. It is your most basic tool for statistical analysis, after your brain. But I hope to go beyond the level you need to, say, do one homework problem in a class, to incorporate techniques and paradigms that have developed and continue to develop in commercial data science and large-scale statistical modeling.

Commercial data science

A statistician working at a company like Amazon is trying to direct their marketing in the most effective way. There are hundreds if not thousands of attributes (dependent variables) to choose from. Target market sectors and specific goals may change weekly. Amazon needs automated methods of selecting significant attributes, building models, assessing their significance and potentially revising the fit, all on the fly. In a real sense, the data scientist is building a statistical application instead of a statistical model.

Large-scale research projects

In medical and biological research today the raw material may be gleaned from whole-genome sequencing and transcription data, and the growing knowledge of the relationships between these "nodes". Statisticians are playing a much greater role in generating the biological knowledge coming from experiments. Work by statisticians has moved from compartmentalized to being a driving force in the collaborations. As such, there is a need for the statistical work to be as independently reproducible as are the experiments.

Reproducible research for statisticians

Reproducible research is an active force in statistical software development and methodology. This supports and is supported by both of the themes mentioned above (statistical applications and reproducible analyses). The well-reported Duke scandal underscores the importance of doing your analyses from scratch in a way that can be reproduced -- by you or anyone else. My immediate goal is for you to begin the seminar with the necessary tools.

Scalable statistical methods

An additional theme in these lectures is to produce analyses that are agnostic to the size of the data. Of course, an analysis with 1000 records and 10 covariates will run much faster than one with 100 million records and 200 covariates. However, we should try to set up analyses of small datasets so that with the right interface to the stored data, and multi-node processing power, they can be implemented with massive datasets. That is, statistical analyses should be scalable.

Tools for Reproducible Research

RStudio

First, you should do R work through RStudio. This is an integrated development environment (IDE). To begin, in the RStudio documentation read the section Using RStudio. Also set your preferences to never save or restore a .RData file.

Baseline knowledge of R

As a starting point, please become familiar with the terms "Working directory", "Workspace", "History". Also, how to use the commands setwd, getwd and navigate the file system for setting the file argument of such commands. Always begin by creating a directory for the project and start by setting this working directory.

Also learn the commands load and save for creating R data objects and loading them into the R session.

Learn how to create R source documents in RStudio and source them in new sessions.

R Markdown

A source document has the disadvantage that you can't see the output of the commands. This only appears in the console after you source it. R Markdown presents a method of producing HTML documents that include both code chunks and the output of the commands. You can structure the document into sections and add text explanations of the analysis. Figures can be added as in any HTML page. Math equations can be added simply by writing the LaTeX commands. See the section on R Markdown in the RStudio documentation.

knitr

knitr is an R package that renders an R markdown document into HTML, including the output of any R code chunks. knitr has it's origins in Sweave that takes documents written in the Sweave format, which is similar to R Markdown, and produces a pdf using LaTeX. knitr fixes a lot of the limitations of generic Sweave.

To use knitr you first need to install the R package. At a command prompt enter

install.package("knitr", dependencies = TRUE)

Normally, to use a package you need to run library(knitr), e.g., however when using it in RStudio this is done in the background when you execute the command Knit HTML.

Reference

A good general reference for R is

Paul Teetor, R Cookbook, O'Reilly Media, March 2011.