Mastodawn

jeromekelleher Jun 20, 2024

But, does this work on real data? Yes! To demonstrate how well Zarr performs on real data with many FORMAT fields we converted chr2 for the Genomics England aggv2 dataset. Overall, we see a 5X reduction in storage compared to the original (12.81TiB over 106 vcf.gz files).

Show thread

jeromekelleher Jun 20, 2024

Extracting individual (1-D) fields is even more extreme. Here we benchmark extracting the POS field and writing to a text file: 21,418 seconds with bcftools on a BCF file, vs 5 seconds using Zarr and Python.

Show thread

jeromekelleher Jun 20, 2024

Where Zarr really starts to shine is when we are interested in *subsets* of the data. By storing fields separately, and by storing the data in each field as a regular grid of compressed chunks, subsetting is much more efficient. Here is the same benchmark on a small sub-matrix.

Show thread

jeromekelleher Jun 20, 2024

Compression isn't everything though - we also want to *compute* with our data. Here is a benchmark in which we perform a simple calculation over the whole genotype matrix (see text for rationale), essentially comparing the computational accessibility of the formats.

Show thread

jeromekelleher Jun 20, 2024

This yields excellent compression performance. Here is a benchmark based on (very realistic) simulations where we compare the Zarr based approach with VCF, BCF and two state-of-the-art methods. Remarkably, Zarr's simple approach does almost as well as Savvy!

Show thread

jeromekelleher Jun 20, 2024

We propose an alternative storage approach for variation data based on the widely used Zarr standard (https://zarr.dev). Rather than grouping all data for a given variant together, we group all data for a given field, and store as chunked, compressed N-D arrays (tensors).

Zarr

Have beautiful data in Zarr? Show us on Bluesky!

Zarr

Show thread

jeromekelleher Jun 20, 2023

We can characterise the origin of recombinants very precisely. Here are some subgraphs showing the origins of Pango X lineages and clustering of samples (based on Nextclade Pango lineage assignments).

Show thread

jeromekelleher Jun 20, 2023

The whole point of an ARG is to account for recombination though - so how do we do there? As far as we can tell, very well. For example, we compare with the early recombinants detected by Jackson et al, with near perfect agreement:

Show thread

jeromekelleher Jun 20, 2023

The method works quite well, despite the simplicity of the tree building model and reliance on parsimony heuristics. Here is a comparison of a "backbone" phylogeny compared to a Nextstrain tree. While there is clearly room for improvement, we're mostly doing well.

Show thread

jeromekelleher Jun 20, 2023

Building on top of this existing library and package infrastructure has huge advantages. For example, here's some code where we load the Wide ARG, and simulate 1.4 million mutations under the Felsenstein 84 model using #msprime (params arbitrary). This takes 2.5 seconds.