Code for Publication - Level 1

Take whatever you've got, archive it online, put a link in the paper. This is low cost & provides an extended methods section where the detailed decisions you made can be found by a sufficiently motivated person. This is a major positive even if no one can successfully run the code.

Everyone's code is messy. We understand. Post your code & I will personally fight anyone who complains that it's messy.

When I say archive I mean https://zenodo.org/ or similar

Zenodo

Code for Publication - Level 2 (Level 1+)

Written instructions for rerunning the code to replicate the analysis (ideally in a file called README). E.g., 1) install packages A, B, & C; 2) download data D; 2) Run script E; 3) manually change F; 4) Run script G, etc.

Ideally go through this process yourself on a computer other than the one you did the analysis on to make sure it works. Even better have a friend do this to make sure someone other than you can follow the steps.

Code for Publication - Level 3 (Level 1/2+)

Automate all the steps for rerunning the analysis using a script. This could be a bash script of a script in the language your code is written in. This should include package installation (ideally w/fixed version numbers if the language allows it). Test this yourself on a computer other than the one you did the analysis, have a friend test it, try to test it on multiple operating systems. Make sure the outputs match the paper.

For research paper code if you get to Level 3 (a single script that reproduces your analysis including package install and any external data acquisition) you're doing great. Thank you. You've provided a clear accounting for exactly how you you produced your final results. I'm happy. Others should be happy. If you do this & someone complains because it's not Level 4+ I will kindly tell them they shouldn't let the perfect be the enemy of the very good & that there are real tradeoffs going further.

Code for Publication - Level 4+ (Level 3+)

There are lots of extra things you can do to make all of this even better:

* Use a workflow system instead of script for automation
* Provide a container (e.g., Docker) with code and data
* Have your code produce either a documented version of itself or the entire paper using literature programming tools (e.g., notebooks , Rmarkdown) (I see @tpoisot already getting ahead of me here in the replies; listen to Tim, he's awesome)

The Level 4+ stuff is great. I love these kinds of tools. We use them daily (though not always for publishing code). But, don't let not using them stop you from publishing code at Levels 1, 2, or 3.

There's a ton of write ups of related ideas and even more examples of them. Here's just a few for further reading:

* https://kbroman.org/steps2rr/ by @kbroman
* https://doi.org/10.1371/journal.pcbi.1005510 by @gvwilson et al.

If you're interested in research software instead of code supporting analyses (a very different situation) see @tpoisot's very nice piece as a starting point http://doi.org/10.4033/iee.2015.8.8.f (and references there in for more)

initial steps toward reproducible research

Suggestions of how to get started, in seeking to adopt a reproducible workflow for one's computational research.

@ethanwhite @kbroman @gvwilson @tpoisot This is something I like to call "layered reproducibility" (discussion here: https://discuss.ropensci.org/t/creating-a-package-to-reproduce-an-academic-paper/1210/2).

I sometimes worry that highly formalized build systems (like targets in R) obscure what could be easier to follow scripts for many users. Targets does make you write the code in reusable-ish functions, though.

I like your README map of steps, Tim! We like to print the graph of steps, but it is not as easy to follow: https://github.com/ecohealthalliance/mpx-diagnosis

Creating a package to reproduce an academic paper

I’d say using Docker or a similar environment is the best approach, especially if you have external system dependencies. That said, I try to aim for layered reproducibility, such that you maximized the reproducibility of the project within successive tools. Making an R package is a very good way of wrapping up the pieces of the package for re-use, with dependencies specified in the DESCRIPTION. This has its limits, as you mention above. So I prefer to also have Dockerfile a creates the envi...

rOpenSci Discuss
@noamross @ethanwhite @kbroman @gvwilson @tpoisot I like the graph from the targets package - I might using it just for that!

@ethanwhite @kbroman @gvwilson @tpoisot
I appreciate the sentiment of your hierarchy.

But realistically, anything that isn't effectively a container of some sort (LXC, docker, Apptainer and so on) will not be fully reproducible by anybody ten years from now. If you're using R the timeframe is more like 3-5 years. If you're still using Python 2 you've already lost.

@jannem @ethanwhite @kbroman @gvwilson @tpoisot
I’m a bit skeptical whether you will still be able to run a container that you’ve created today in ten years from now.
Docker was released as open source just over ten years ago.
Are container formats from 10 years ago still compatible with today’s runtimes?
Also where do you store the images? Will docker-hub etc. still exist?
Will the images still be accessible?
@jannem @ethanwhite @kbroman @gvwilson @tpoisot
Sure, you can host your own container registry, but you will be moving it to new hardware at least once or twice over the next ten years.
@jannem I agree about the decay of full reproducibility, but also agree with @ostueker that I'm skeptical of even containers at the scale of a decade. However, if we want to get to the point of fully reproducible in a decade for all scholarship we're going to have to get their gradually and so supporting partial steps in that direction is good/necessary. If it's publish a container or nothing then for 90% of the folks in my field it will by nothing. @kbroman @gvwilson @tpoisot