I’m looking for some *really big* (ideally millions of rows) biological datasets for a “Data Science in Biology” course.

Ideally they should be:

* archived with a DOI
* have an associated paper or two, with some cool questions
* be messy observational data, or collated across many studies

If you have any pointers, I’d be extremely grateful! Please boost!

@RobLanfear massive amounts of distribution data in GBIF, which also gives DOIs to the data sets used by thousands of published papers that used GBIF data: https://www.gbif.org/
GBIF

Global Biodiversity Information Facility. Free and Open Access to Biodiversity Data.

@RobLanfear I can deliver partially. DOI yes, paper yes, open questions, yes. But not messy. I can also help craft interesting questions. https://www.nature.com/articles/s41597-022-01179-8
SkewDB, a comprehensive database of GC and 10 other skews for over 30,000 chromosomes and plasmids - Scientific Data

Measurement(s) Imbalances in the use of DNA nucleotides Technology Type(s) Next Generation Sequencing Factor Type(s) Position within DNA sequence • Organism type Sample Characteristic - Organism bacterium • archaea Sample Characteristic - Environment Varying Sample Characteristic - Location World

Nature

@RobLanfear
You might check out the Drosophila Evolution over Space and Time dataset (DEST) that we put together. It is a large population genomic dataset of pool-seq for flies and contains spatial and temporal samples, organized metadata, and is easily accessible in a variety of formats

https://academic.oup.com/mbe/article/38/12/5782/6361628
https://dest.bio

Drosophila Evolution over Space and Time (DEST): A New Population Genomics Resource

Abstract. Drosophila melanogaster is a leading model in population genetics and genomics, and a growing number of whole-genome data sets from natural population

OUP Academic

@RobLanfear
GeneNetwork by Rob Williams @pjotrp et al.

https://genenetwork.org/intro

Introduction GeneNetwork 2

@RobLanfear Single-cell RNA-sequencing datasets? Individual datasets may not be as big as what you’re looking for (though the Fly Cell Atlas, for example, has already data for ~550,000 cells × ~14,000 genes), but if you combine datasets from several studies, you can easily get to several millions of cells (with the added messiness bonus).

The EBI’s Single Cell Expression Atlas currently has data for ~8.5 millions of cells.
Home < Single Cell Expression Atlas < EMBL-EBI

EMBL-EBI Single Cell Expression Atlas, an open public repository of single cell gene expression data

marina alberti (@[email protected])

Attached: 1 image The global spectrum of plant form and function: enhanced species-level trait dataset A new data set https://www.nature.com/articles/s41597-022-01774-9 published in Nature Scientific Data by Sandra Díaz et al. provides species mean values for six key plant species traits that define the primary axes of variation in plant form and function. The dataset which covers >46,047 species is based on > 1 million trait records collected via the TRY database representing ca. 2,500 original publications. #SpeciesTraits #Plants

ecoevo.social
@RobLanfear already tried reddit/dataset?
@RobLanfear If you want behavioural data (x,y coordinates of motion in flies) I can point you to our published datasets. It's hundreds of GBs in sqlite3 format.
@giorgiogilestro @RobLanfear quite a lot of similar data for fly DGRP lines so could go on a bit of a fishing trip looking for correlations between different traits http://dgrp2.gnets.ncsu.edu/data.html
Data

@RobLanfear We've been working a lot with this public dataset of 661K Bacterial genomes

https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

This study presents the first uniformly assembled, comprehensively described and searchable dataset of 661,405 bacterial genomes; this resource will empower more scientists to harness the multitude of data in public sequencing archives, but also reveals the biased composition of these archives, with 90% of the data originating from just 20 species.

@RobLanfear
Another vote for GBIF (AKA the Global #Biodiversity Information Facility) - find them in the fediverse at @gbif
If you're particularly looking for messy data then you can examine the issues attached to each record which flag problems like flipped coordinates etc. All datasets assigned DOIs and custom downloads assigned DOIs too. Cited uses available for exploration here: https://www.gbif.org/resource/search?contentType=literature&literatureType=journal&relevance=GBIF_USED&peerReview=true
Theres also an #OpenData Ambassadors scheme, people listed here: https://www.gbif.org/composition/6iHKXo8pUyRPJ2Ut0683Z8/ambassadors
Resources

Search for resources in Global Biodiversity Information Facility. Free and Open Access to Biodiversity Data.

@RobLanfear www.fathomnet.org. Will be archived early next year.

@RobLanfear

I don't know what kind of biology data you are looking for, but there are two large ecology repositories that may be of interest.

Check https://www.movebank.org and specifically data repository for data that are published with papers: https://www.movebank.org/cms/movebank-content/data-repository.

Another one is https://www.gbif.org/data produced by @gbif.

Movebank

@RobLanfear @kakanikatija You might find some suitable datasets on the AWS Open Data site (disclosure: I work for AWS and haven’t been in a lab for many years 😁). While I don’t see DOIs associated with many of the datasets I’ve browsed through, most do have citations and all are free to use. The Genome Aggregation Database and Tabla Muris might suit your needs.

https://aws.amazon.com/opendata/

Open Data on AWS

Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. Browse available data and learn how to register your own datasets.

Amazon Web Services, Inc.
@RobLanfear We've definitely got stuff that scale on the Cell Painting gallery! https://registry.opendata.aws/cellpainting-gallery/
Cell Painting Gallery - Registry of Open Data on AWS

@RobLanfear A whole bunch of data at GeneNetwork.org. The majority is mouse (and especially the BXD family), but various species and types of data.
@RobLanfear You might try iNaturalist. It's huge (tens of millions of entries) has a few thousand papers based on it, is really messy in a lot of ways, and your students can add their own datapoints.
https://dx.doi.org/10.15468/ab3s5x
iNaturalist Research-grade Observations

Observations from iNaturalist.org, an online social network of people sharing biodiversity information to help each other learn about nature. Observations included in this archive met the following requirements: * Published under one of the following licenses or waivers: 1) http://creativecommons.org/publicdomain/zero/1.0/, 2) http://creativecommons.org/licenses/by/4.0/, 3) http://creativecommons.org/licenses/by-nc/4.0/ * Achieved one of following iNaturalist quality grades: Research * Created on or before 2025-06-03 15:00:33 -0700 You can view observations meeting these requirements at https:…

@RobLanfear maybe my colleague @itchyshin will see this and be able to help.
@RobLanfear lots of genome size and chromosome numbers to compare and pull out trends from at https://www.genomesize.com/
Animal Genome Size Database:: Home

@RobLanfear Flow Cytometry data typically has millions of rows. Here's a website with lots of public datasets, including manuscript links:

https://flowrepository.org/public_experiment_representations

FlowRepository

FlowRepository is a public database of flow cytometry experiments where you can query and download data collected and annotated according to the MIFlowCyt standard. It supports storage, annotation, analysis, and sharing of flow cytometry datasets.