Research papers increasingly refer to websites like GitHub for code, which may be a problem...

About 20% of preprints referred to GitHub in 2021. Yet, these web pages are often not archived, jeopardizing their long-term availability.

"The growing use of Git Hosting Platforms in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code"

via https://doi.org/10.48550/arXiv.2208.04895
#OpenScience #OpenSource @academicchatter

The Rise of GitHub in Scholarly Publications

The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora from January 2007 to December 2021. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.66 million publications in the corpora. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 160 times in 2007 and 76,746 times in 2021. In 2021, one out of five publications in the arXiv corpus included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.

arXiv.org
@erinnacland @academicchatter If I was to guess, I'd say academic institutions are at most about as reliable, and most likely less reliable, at keeping anything–code, data, methods–than a company like github. Because long-term funding for academic research is largely non-existent.
@erinnacland @academicchatter While archives of code sites are needed, Computer Science publications have been guilty forever of an even greater problem, which is publishing claims without any documentation of the code used to arrive at them. Most CS published articles without such code are usually useless in terms of reproducibility or enabling incremental progress. So please get the criticism the right way around: we need more published code not less!
@erinnacland @academicchatter To sharpen this a bit: The problem is not in CS publications that refer to code that does not yet have sufficiently well thought out long-term archives. The problem is in CS and other publications for which links and adequate references to code and data don't exist at all!
@erinnacland @academicchatter @float13 wait, there are academics who post code? That's amazing! We definitely do need to archive them, but still.

@erinnacland
That’s a real worry. I included a ling appendix in my Masters thesis with pseudo code, along with a CD and thumb drive of the actual code when I lodged it. But that’s obviously problematic, too.

@academicchatter @kcarruthers