Excited to share my latest preprint (with a stellar band of collaborators), where we map the VCF data model into an efficient, cloud-native storage format. Thread follows:
https://www.biorxiv.org/content/10.1101/2024.06.11.598241v1
VCF is in many ways a tremendous success, providing a single channel through which all kinds of genomics data flows, with (mostly) good interoperability. It is a great archival format. However, it is not an efficient basis for computation, particularly at Biobank scale.
It is the *row-wise* storage of data used by VCF (and most of its proposed alternatives, including BCF) that is most fundamentally limiting. It is not possible to efficiently extract a particular field or sample from row-wise variant stores.
We propose an alternative storage approach for variation data based on the widely used Zarr standard (https://zarr.dev). Rather than grouping all data for a given variant together, we group all data for a given field, and store as chunked, compressed N-D arrays (tensors).
Zarr

Have beautiful data in Zarr? Show us on Bluesky!

Zarr
This yields excellent compression performance. Here is a benchmark based on (very realistic) simulations where we compare the Zarr based approach with VCF, BCF and two state-of-the-art methods. Remarkably, Zarr's simple approach does almost as well as Savvy!
Compression isn't everything though - we also want to *compute* with our data. Here is a benchmark in which we perform a simple calculation over the whole genotype matrix (see text for rationale), essentially comparing the computational accessibility of the formats.
Where Zarr really starts to shine is when we are interested in *subsets* of the data. By storing fields separately, and by storing the data in each field as a regular grid of compressed chunks, subsetting is much more efficient. Here is the same benchmark on a small sub-matrix.
Extracting individual (1-D) fields is even more extreme. Here we benchmark extracting the POS field and writing to a text file: 21,418 seconds with bcftools on a BCF file, vs 5 seconds using Zarr and Python.
But, does this work on real data? Yes! To demonstrate how well Zarr performs on real data with many FORMAT fields we converted chr2 for the Genomics England aggv2 dataset. Overall, we see a 5X reduction in storage compared to the original (12.81TiB over 106 vcf.gz files).
Zarr is currently used to store multiple petabyte scale scientific datasets (https://zarr.dev/datasets/) and with multiple implementations (https://zarr.dev/implementations/…). It is cloud-native, with first-class support for object stores like S3. It scales.
Zarr Datasets

Zarr is an open source project developing specifications and software libraries for storage of data that is structured as N-dimensional typed arrays (also known as tensors) in a way that is compatible with parallel and distributed computing applications.

Zarr
We provide the draft VCF Zarr specification, which formalises the mapping from VCF to Zarr. While we're confident it captures the vast majority of use-cases, there's probably still lots of details that need working out. Feedback and contributions welcome!
We also provide the vcf2zarr converter, as part of the fledgling bio2zarr package. It supports both parallel and distributed conversion, and can handle very large datasets. It could also be improved in many ways - feedback and contributions welcome!
https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html
vcf2zarr — bio2zarr Documentation

We hope that these tools can provide the starting point for a new generation of tools that process genetic variation data. The VCF Zarr spec should provide a stable platform for methods developers, who can enjoy efficient, scalable access to data.
One necessary piece of infrastructure that does not yet exist is a "vcztools" package, that implements some of the read-only functionality of bcftools. This could provide compatibility with existing workflows, allowing a cloud-based Zarr store have file-like semantics.
If you are interested in this, or any other aspect of the work, please do get in contact! VCF Zarr has lots of potential, but this can only be realised if it is widely adopted. Efficient, FAIR access to VCF data *is* possible, but only with a concerted, community effort.