NEW Preprint alert! Policymakers, medics, data analysts, industry IP lawyers, open research folks, this is one for you: COVID-19: An exploration of consecutive systemic barriers to pathogen-related data sharing during a pandemic.

https://doi.org/10.48550/arXiv.2205.12098

๐Ÿงต - boosts appreciated!

COVID-19: An exploration of consecutive systemic barriers to pathogen-related data sharing during a pandemic

In 2020, the COVID-19 pandemic resulted in a rapid response from governments and researchers worldwide. As of May 2022, over 6 million people died as a result of COVID-19 and over 500 million confirmed cases, with many COVID-19 survivors going on to experience long-term effects weeks, months, or years after their illness. Despite this staggering toll, those who work with pandemic-relevant data often face significant systemic barriers to accessing, sharing or re-using this data. In this paper we report results of a study, where we interviewed data professionals working with COVID-19-relevant data types including social media, mobility, viral genome, testing, infection, hospital admission, and deaths. These data types are variously used for pandemic spread modelling, healthcare system strain awareness, and devising therapeutic treatments for COVID-19. Barriers to data access, sharing and re-use include the cost of access to data (primarily certain healthcare sources and mobility data from mobile phone carriers), human throughput bottlenecks, unclear pathways to request access to data, unnecessarily strict access controls and data re-use policies, unclear data provenance, inability to link separate data sources that could collectively create a more complete picture, poor adherence to metadata standards, and a lack of computer-suitable data formats.

arXiv.org

The study is a qualitative interview study with researchers, data scientists and software engineers, civic data specialists, medics, and pandemic modellers, talking about their experiences and barriers accessing, using, and sharing COVID-19 data.

As we all know, pandemics and epidemics present an urgency for knowledge not found in "peace" times - and every subsequent barrier means delayed response, missed waves, deaths, and long-term disabilities that could have been prevented. /2

We take a very broad view what data is relevant here: it's not just infections, hospitalisations, death, and vaccines, but also genomes - virus genomes, patient genomes, mobility data, geographic regions, and even movement restrictions.

All in all we found five categories of barriers around pathogen-related data sharing. 1. Knowing data exists, 2. accessing that data, 3. using the data once accessed, 4. re-sharing your analyses, and 5. human throughput. /3

Barriers 1 - 4 are often experienced sequentially, but that 5th barrier, human throughput, is woven throughout the other 4. Many of the barriers are unsurprising - we know that data sharing could be better.

Accessing data may require friends in high places who can grant you access or expedite your request. You might not be able to afford to pay for data and need to apply for a grant first. Or just apply for access and hope the money appears?

/4

Digging into use barriers a little more, we note that they can include unreliable and untrustworthy data, disappearing or corrupt data, copy-pasting data from websites, PDFs, and even graphics (sob), and restrictive licence requirements.

Don't get me started on file formats! Apart from the notorious UK Excel bug https://bbc.co.uk/news/technology-54423988 - so so so many people talked about lack of compliance to data standards, hidden excel columns, downloading from databases and manually annotated files... /5

Excel: Why using Microsoft's tool caused Covid-19 results to be lost

The decision to use a spreadsheet format that dates back to the 1980s has proved to be unwise.

BBC News

There's far more than I could possibly include in a tweet thread! One of the _most_ interesting findings I've had so far was that we need _temporal_ metadata, especially around geographical regions and mixing laws. I'll explain:

Lockdowns, masking, "no more than 6 in a group", etc. - usually it was possible to find out what rules were in place _today_, but it was MUCH harder to find out what the laws were, say, three months ago. BUT... /6

When a data scientist is modelling or explaining a spike or lull in pathogen spread, knowing the human mixing safety rules at the time is imperative! Ideally, laws would be machine-readible to feed automatically into computational models. /7
Another thing. Pathogen spread is SO political! Multiple people from multiple countries told me about data that made the the government's grasp of the situation look bad that, ah, "disappeared". How can we follow the science if the governments are complicit in hiding information? /8

I'll wrap up my thread here, but leave a reminder: _planning_ and creating flexible data infrastructures and data champions is imperative. Millions have died, but with better information sharing we can create informed policy and put fewer people at risk.

The best time to plant that tree (appropriate pathogen data sharing infrastructure) was before the pandemic of course, but the next-best time to plant it is today. /End thread

Or a toot thread ๐Ÿ™ˆ