Code for Publication - Level 1

Take whatever you've got, archive it online, put a link in the paper. This is low cost & provides an extended methods section where the detailed decisions you made can be found by a sufficiently motivated person. This is a major positive even if no one can successfully run the code.

Everyone's code is messy. We understand. Post your code & I will personally fight anyone who complains that it's messy.

When I say archive I mean https://zenodo.org/ or similar

Zenodo

Code for Publication - Level 2 (Level 1+)

Written instructions for rerunning the code to replicate the analysis (ideally in a file called README). E.g., 1) install packages A, B, & C; 2) download data D; 2) Run script E; 3) manually change F; 4) Run script G, etc.

Ideally go through this process yourself on a computer other than the one you did the analysis on to make sure it works. Even better have a friend do this to make sure someone other than you can follow the steps.

Code for Publication - Level 3 (Level 1/2+)

Automate all the steps for rerunning the analysis using a script. This could be a bash script of a script in the language your code is written in. This should include package installation (ideally w/fixed version numbers if the language allows it). Test this yourself on a computer other than the one you did the analysis, have a friend test it, try to test it on multiple operating systems. Make sure the outputs match the paper.

For research paper code if you get to Level 3 (a single script that reproduces your analysis including package install and any external data acquisition) you're doing great. Thank you. You've provided a clear accounting for exactly how you you produced your final results. I'm happy. Others should be happy. If you do this & someone complains because it's not Level 4+ I will kindly tell them they shouldn't let the perfect be the enemy of the very good & that there are real tradeoffs going further.

Code for Publication - Level 4+ (Level 3+)

There are lots of extra things you can do to make all of this even better:

* Use a workflow system instead of script for automation
* Provide a container (e.g., Docker) with code and data
* Have your code produce either a documented version of itself or the entire paper using literature programming tools (e.g., notebooks , Rmarkdown) (I see @tpoisot already getting ahead of me here in the replies; listen to Tim, he's awesome)

The Level 4+ stuff is great. I love these kinds of tools. We use them daily (though not always for publishing code). But, don't let not using them stop you from publishing code at Levels 1, 2, or 3.

There's a ton of write ups of related ideas and even more examples of them. Here's just a few for further reading:

* https://kbroman.org/steps2rr/ by @kbroman
* https://doi.org/10.1371/journal.pcbi.1005510 by @gvwilson et al.

If you're interested in research software instead of code supporting analyses (a very different situation) see @tpoisot's very nice piece as a starting point http://doi.org/10.4033/iee.2015.8.8.f (and references there in for more)

initial steps toward reproducible research

Suggestions of how to get started, in seeking to adopt a reproducible workflow for one's computational research.

@ethanwhite @kbroman @gvwilson @tpoisot This is something I like to call "layered reproducibility" (discussion here: https://discuss.ropensci.org/t/creating-a-package-to-reproduce-an-academic-paper/1210/2).

I sometimes worry that highly formalized build systems (like targets in R) obscure what could be easier to follow scripts for many users. Targets does make you write the code in reusable-ish functions, though.

I like your README map of steps, Tim! We like to print the graph of steps, but it is not as easy to follow: https://github.com/ecohealthalliance/mpx-diagnosis

Creating a package to reproduce an academic paper

I’d say using Docker or a similar environment is the best approach, especially if you have external system dependencies. That said, I try to aim for layered reproducibility, such that you maximized the reproducibility of the project within successive tools. Making an R package is a very good way of wrapping up the pieces of the package for re-use, with dependencies specified in the DESCRIPTION. This has its limits, as you mention above. So I prefer to also have Dockerfile a creates the envi...

rOpenSci Discuss
@noamross @ethanwhite @kbroman @gvwilson @tpoisot I like the graph from the targets package - I might using it just for that!

@ethanwhite @kbroman @gvwilson @tpoisot
I appreciate the sentiment of your hierarchy.

But realistically, anything that isn't effectively a container of some sort (LXC, docker, Apptainer and so on) will not be fully reproducible by anybody ten years from now. If you're using R the timeframe is more like 3-5 years. If you're still using Python 2 you've already lost.

@jannem @ethanwhite @kbroman @gvwilson @tpoisot
I’m a bit skeptical whether you will still be able to run a container that you’ve created today in ten years from now.
Docker was released as open source just over ten years ago.
Are container formats from 10 years ago still compatible with today’s runtimes?
Also where do you store the images? Will docker-hub etc. still exist?
Will the images still be accessible?
@jannem @ethanwhite @kbroman @gvwilson @tpoisot
Sure, you can host your own container registry, but you will be moving it to new hardware at least once or twice over the next ten years.
@jannem I agree about the decay of full reproducibility, but also agree with @ostueker that I'm skeptical of even containers at the scale of a decade. However, if we want to get to the point of fully reproducible in a decade for all scholarship we're going to have to get their gradually and so supporting partial steps in that direction is good/necessary. If it's publish a container or nothing then for 90% of the folks in my field it will by nothing. @kbroman @gvwilson @tpoisot
@ethanwhite we've gotten a lot of value out of also including into the container an interface like RStudio server (rocker) or jupyterhub/binder for reviewer/reader/student exploration in a browser. Makes it 'heavier' but increases convenience when re-engaging with a project. e.g. https://github.com/ZimmermanLab/SF-metrosideros-endophytes
GitHub - ZimmermanLab/SF-metrosideros-endophytes: Data, code, and manuscript text for an article characterizing the fungal endophytic communities of Metrosideros excelsa across the city of San Francisco.

Data, code, and manuscript text for an article characterizing the fungal endophytic communities of Metrosideros excelsa across the city of San Francisco. - GitHub - ZimmermanLab/SF-metrosideros-end...

GitHub
@ethanwhite the nice thing about a setup like this as an instructor/mentor is then you can spin up an instance on AWS or Digital Ocean, start the docker container, then just point a student to the droplet url with the RStudio server port. Zero config for them to start engagement with the project even if it's been years since they last took a coding class and installed R.
@ethanwhite and shoutout to the awesome Binder project for folks who don't know about it yet https://mybinder.org/
The Binder Project

Reproducible, sharable, open, interactive computing environments.

@naupaka good call mybinder is definitely a big deal in this space
@naupaka nice! I really like it. Lots of technical hurdles for folks getting started, but definitely a great example of Level 4+ tools
@ethanwhite haha yup. Gotta love step one being 'install docker and git'. Definitely some 'draw the owl' vibes. I am completely in agreement with you that README files and some scripts in an archive are a fabulous place to start.
@ethanwhite I have never learned how to use containers (I am still processing the trauma of pyenv, in my defense). Do you find the overhead worth it for code that isn't going to be widely reused?
@tpoisot @ethanwhite I do -- it's fast/easy if you start with a good base. I make huge use of parameterized docker containers for teaching in most of my classes. For me the selling point is getting any project up to speed on any server basically instantly. And, for example having a loop take a csv to make environments for 50 students at once. I'm not using them as they were intended I suppose, but I think they're pretty great as lightweight VMs (vs something slightly heavier like vagrant)
Docker for Teaching

Slides for Carpentries Skill Share on using Docker for Teaching held on May 15, 2019.

figshare
@naupaka @ethanwhite this is brilliant, and will in no way be supported by our local IT...
@tpoisot @ethanwhite depending on computational needs, easy to do on cloud compute as well. but yeah, need lots of open ports (can be only open inside campus firewall if students use vpn) and root on a large-ish server. Singularity (https://docs.sylabs.io/guides/3.5/user-guide/introduction.html) might be more palatable to IT than docker, but you still need open ports
Introduction to Singularity — Singularity container 3.5 documentation

@naupaka singularly + inside the campus firewall is the exact combination that makes it possible for us to convince our IT folks to let us do things in this space @tpoisot
@tpoisot for paper code (which we don't expect to be widely reused, otherwise we'd package it) we've moved away from containers. I think the likelihood of anyone spinning them up is low and the code mostly serves the "extended documentation" role anyway. Much of their value also assumes long-term public hosting and with recent ongoings at Docker I'm not sure that's a safe assumption.
@ethanwhite @tpoisot I think if you have a strong interest in a container working long-term you should store the built container rather than the Dockerfile. We should probably template up the workflow of depositing the container binary with code in a Zenodo-like repository.
@noamross agreed. that would definitely be handy. @tpoisot

@ethanwhite @tpoisot

In #Maneage [1], level 4+ is different:

* in analysis/ we use 'make' for the higher-level workflow, encouraging bash scripts for details;

* in software/ we use 'make' to build all the software with sha512sum checks on the downloads, starting from a minimal unix-like system;

* the makefiles initialize.mk and paper.mk are the workflow for the paper

Fully reproduce:
./project configure
./project make

Example: [2]

[1] https://maneage.org
[2] https://zenodo.org/record/7792910

Maneage -- Managing data lineage

@boud very nice! @tpoisot
@ethanwhite @boud Oh yeah, data semantics and provenance breaks my brain in ways that code can't come close to -- I'm definitely going to have a look at maneage!

@tpoisot

Cool! :) The main 'tasks' and 'bugs' are coordinated at savannah [1]; we have some loosely organised irc/matrix channels e.g. [2].

To test a real science paper (peer-reviewed), not too heavy computationally, I recommend [3] (a not-quite-final version ran fully from scratch on a pinephone).

[1] https://savannah.nongnu.org/support/?func=additem&group=reproduce

[2] irc: https://libera.chat ##maneage
matrix bridge: #maneage-community:matrix.org

[3] https://arxiv.org/abs/2112.14174 = https://zenodo.org/record/6794222

@ethanwhite

Maneage - Support: Submit Item [Savannah]

Savannah is a central point for development, distribution and maintenance of free software, both GNU and non-GNU.

@ethanwhite something we've done for a paper was to write all the scripts using literate programming, so that at the end the code became the readme, and then every step became its own notebook.

https://github.com/PoisotLab/MetawebTransferLearning

This was not that much extra work, and it also resulted in code that was far more useful to the collaborators not involved in the programming steps. Strongly recommend it.

GitHub - PoisotLab/MetawebTransferLearning

Contribute to PoisotLab/MetawebTransferLearning development by creating an account on GitHub.

GitHub
@tpoisot @ethanwhite agreed. We did the same for a recent paper https://UZH-PEG.github.io/diversity_envresp1/ and it saved us a lot of work when we had to re-run the analysis to make reviewers happy.
Guide to repository of supplementary resources