Mastodawn

Vince Buffalo Oct 2, 2023

As a prototype, I have built a SciDataFlow Asset for the NYGC high-coverage 1000 Genomes data. You can see it here: https://github.com/scidataflow-assets/nygc_gatk_1000G_highcov

In just 2 lines, you can retrieve a Data Manifest from SciDataFlow-Assets and retrieve all 1000 Genome data concurrently:

GitHub - scidataflow-assets/nygc_gatk_1000G_highcov: NYGC high-coverage 1000 Genomes GATK Calls

NYGC high-coverage 1000 Genomes GATK Calls. Contribute to scidataflow-assets/nygc_gatk_1000G_highcov development by creating an account on GitHub.

GitHub

Show thread

Vince Buffalo Oct 2, 2023

Writing and sharing a Data Manifest = making your scientific data an asset.

Please contribute, and I welcome any feedback!

Show thread

Vince Buffalo Oct 2, 2023

Since SciDataFlow's Data Manifest serves as a minimal recipe for easy data retrieval & sharing, it makes it effortless to download and incorporate data into your work.

SciDataFlow-Assets is a community-led effort to build these recipes for core datasets.

https://github.com/scidataflow-assets

SciDataFlow-Assets

Little recipes to download scientific data assets into your project. - SciDataFlow-Assets

GitHub

Show thread

Vince Buffalo Oct 2, 2023

The data produced by a project is in essence a scientific "asset". Yet, all too often these data assets are lost and/or cannot be easily reused by others. We need to change this!

Vince Buffalo Oct 2, 2023

Effective science isn't about a final publication; it's about the availability of data generated by research for reanalysis and reuse.

A healthy scientific workflow should make it trivial to incorporate prior data into your work.

Enter SciDataFlow's new simple feature: Assets⬇️

Show thread

Vince Buffalo Sep 8, 2023

However, the classic BGS theory (black line) is quite inaccurate (points are true values) when mutations are only weakly selected against. This has potential impacts on our model estimates. We use a whole new theoretic approach that works under weak selection (colored lines).

Show thread

Vince Buffalo Sep 8, 2023

Previous work established that BGS is the dominant process generating large-scale patterns in genetic variability across chromosomes in humans. This signal is shaped by the spatial distribution of conserved regions and recombination rates along the genome.

Show thread

Vince Buffalo Sep 8, 2023

Our simulations show our approach to interference does lead to more accurate predictions of genetic diversity. This is suggestive evidence that interference could be occurring in humans, but further work is needed. Overall, we still have a lot to learn about selection in humans!

Show thread

Vince Buffalo Sep 8, 2023

By extending our method to approximate how selection in one region can impact selection in others ("selective interference") and refitting everything, we find this model fits as well and brings substitution rates into agreement with divergence levels (blue range in image above).

Show thread

Vince Buffalo Sep 8, 2023

But, there is a problem: since our method also predicts substitution rates, we can compare these to observed divergence across features (teal and green ranges). We find our method (and previous BGS approaches) predicts far too low a substitution rate for very conserved regions.

Website	https://vincebuffalo.com/
Google Scholar	https://scholar.google.com/citations?user=7w_tyXUAAAAJ&hl=en
Twitter	@vsbuffalo