Jon Claerbout said:
An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.
What is reproducible research and why is it important
Defensive Programming
Literate Programming
Next Level Stuff
Helpful Programs
It is more a way of life than a single thing.
At its most basic it can be simplified as:
Victoria Stodden has suggested that a more sophisticated notion of reproducible research involves:
It all depends.
Are you:
If you answered “yes” to any of those questions, then it does matter for you.
Writing code is an important part of reproducible research.
Nothing point-and-click is reproducible – ever!
Code \(\neq\) reproducible
Just because you have written the code, does not mean it is reproducible.
The same can be said for code you have gotten from someone else.
# Generating column means
res = 1:11
for(i in 1:11) {
res[i] = apply(mtcars[i], 2, function(x) {
sum(x) / 32
})
}
# Adding a new variable.
library(dplyr)
mtcars %>%
mutate(wtRaw = wt * 1000)
res = 1:ncol(mtcars)
for(i in 1:ncol(mtcars)) {
res[i] = apply(mtcars[i], 2, function(x) {
sum(x) / nrow(mtcars[i])
})
}
This leads us to…
Usually thought about in the context of production software.
Security
Unforeseen errors
We can apply principles of defensive programming to our own research.
Fortunately, we do not need to account for the same things that software engineers need to.
Just because we do not need to take our worry to the production software level, does not mean we do not need to plan for the eventual breaks that can happen.
We should also be thinking about what we might want to do in the future.
Sometimes, we need to build a better mousetrap.
This is, however, not usually the case.
Look for pre-existing code that already serves your purpose.
for(i in 1:ncol(mtcars)) {
res[i] = apply(mtcars[i], 2, function(x) {
sum(x) / nrow(mtcars[i])
})
}
colMeans(mtcars)
Formulated by Donald Knuth!
The guy who brought us \(\TeX\) and that great quote earlier!
At its most basic, it is the inverse of documentation.
\(Documentation\, =\, Comments\,within\, code\)
\(Literate\, Programming\, =\, Code\, within\, exposition\)
# Creating Raw Weight
library(dplyr)
mtcars %>%
mutate(wtRaw = wt * 1000)
# In the raw data, the "wt" variable (vehicle weight) is the actual weight
# divided by 1000. All in all, this is nothing too major. However, having
# weight represented in thousands causes increased mental processing in
# visualizations, in addition to unnecessary words to explain the variable.
# To that end, we will create a new variable within the data -- "wtRaw".
# To create "wtRaw", we will simply multiply the original "wt" variable
# by 1000. Given the simplicity and simple syntax, we will be using
# the "mutate" statement from the "dplyr" package.
library(dplyr)
mtcars %>%
mutate(wtRaw = wt * 1000)
I know exactly one person who has gone full-on literate programming.
Instead of just thinking through your code, start writing your “code thoughts” down.
We already started down the path of the why and what.
We should probably talk about the how.
Writing code is just one part of the research process.
Although it is important, that is not where reproducible research stops.
Your data and code can “live” within documents.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 11.25831 | 7.31834 | 1.53837 | 0.13604 |
complaints | 0.68242 | 0.12884 | 5.29643 | 0.00002 |
privileges | -0.10328 | 0.12935 | -0.79851 | 0.43181 |
learning | 0.23798 | 0.13941 | 1.70702 | 0.09974 |
This text would merely describe my results. I might offer my model’s equation:
\[ \begin{aligned} \hat{rating} = 11.2583051\, + complaints * 0.6824165\, + \\ privilges * -0.1032843\, + \\ learning * 0.2379762 \end{aligned} \]
Then, I might explain something about the “complaints” variable: Complaints is significant at p < .05.
That might have looked like just plain text.
But there is code living behind the words:
ifelse(tidySummary$p.value[which(tidySummary$term == "complaints")] < .05,
"Complaints is significant at p < .05",
"Complaints is not significant at p <.05")
“If you are going to do something more than twice, write a function.” – MC
pValueWriter = function(dat, varName) {
ifelse(dat$p.value[which(dat$term == varName)] < .05,
paste(varName, " is significant at p < .05", sep = ""),
paste(varName, " is not significant at p <.05", sep = ""))
}
Copying and pasting, while not inherently bad, can lead to future pain.
How do you generally create your tables?
Word?
\(\LaTeX\)?
Something else?
rating | ||
(1) | (2) | |
complaints | 0.780*** | 0.682*** |
(0.119) | (0.129) | |
privileges | -0.050 | -0.103 |
(0.130) | (0.129) | |
learning | 0.238* | |
(0.139) | ||
Constant | 15.328** | 11.258 |
(7.160) | (7.318) | |
N | 30 | 30 |
R2 | 0.683 | 0.715 |
Adjusted R2 | 0.660 | 0.682 |
Residual Std. Error | 7.102 (df = 27) | 6.863 (df = 26) |
F Statistic | 29.095*** (df = 2; 27) | 21.743*** (df = 3; 26) |
Notes: | ***Significant at the 1 percent level. | |
**Significant at the 5 percent level. | ||
*Significant at the 10 percent level. |
Collaboration
More importantly – version control.
There are many flavors of Git:
GitHub
GitLab
BitBucket
Markup language for document creation.
Primary aim is web, but can do others.
Document creation
Puts everything together in one tidy source.
Starting down the reproducibility path is easy.
Continuing down the path is what starts to become onerous.