Data Descriptor: Coronavirus research topics, tracking twenty years of research . “To explore research trends and innovations in this space, we developed a pipeline using natural language processing techniques. This pipeline systematically catalogues and synthesises the vast array of research articles, leading to the creation of a dataset with more than eight hundred thousand articles from […]

https://rbfirehose.com/2025/06/27/data-descriptor-coronavirus-research-topics-tracking-twenty-years-of-research/

Data Descriptor: Coronavirus research topics, tracking twenty years of research | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Gallaudet News: Gallaudet experts drive accessibility of speech tech for deaf voices . “Some people use their voices to control tech, from cell phones and remote controls to home appliances and in transportation. Voice command capabilities are made possible through training AI and machine learning. The Speech Accessibility Project is creating datasets of more diverse speech patterns, which […]

https://rbfirehose.com/2025/06/27/gallaudet-news-gallaudet-experts-drive-accessibility-of-speech-tech-for-deaf-voices/

Gallaudet News: Gallaudet experts drive accessibility of speech tech for deaf voices | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Howard University: Howard University and Google Research Enhance A.I. Speech Recognition of African American English. “Researchers collected 600 hours of data from users of different [African American English] dialects in an effort to address implicit barriers to improving [automatic speech recognition] performance. Thirty-two states are represented in the dataset.”

https://rbfirehose.com/2025/06/26/howard-university-howard-university-and-google-research-enhance-a-i-speech-recognition-of-african-american-english/

Howard University: Howard University and Google Research Enhance A.I. Speech Recognition of African American English | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Data Rescue Project: Data Rescue Project Launches New Portal. “The Data Rescue Project (DRP) is excited to announce the launch of the DRP Portal—a milestone in our collective effort to protect and preserve at-risk public information. … The Portal makes it easy to discover rescued datasets by government offices sharing the data, topic, and more.”

https://rbfirehose.com/2025/06/25/data-rescue-project-data-rescue-project-launches-new-portal/

Data Rescue Project: Data Rescue Project Launches New Portal | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

My prediction is that we won’t ever get public release of early OpenAI, Google, or even Anthropic #training #datasets.

Why? There are too many rich hard-right conservative backers who need all the misogyny, racism & hate speech to stay there.

We could have just & equal #AI, but we won’t. There’s too much money & power to be made of injustice.

Scientific Data: City-Defined Neighborhood Boundaries in the United States . ” Researchers lack widespread but locally-sourced data on neighborhoods, and instead often adopt widely available but arbitrary Census geographies as neighborhood proxies. … We address this tension between scale and precision by collecting, cleaning, and providing to researchers a new dataset of city-defined […]

https://rbfirehose.com/2025/06/21/scientific-data-city-defined-neighborhood-boundaries-in-the-united-states/

Scientific Data: City-Defined Neighborhood Boundaries in the United States | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Data Rescue Project: Why We’re Starting a New Federal Data Forum. “[Population Reference Bureau] recently launched the Federal Data Forum—a centralized online community designed to unite public data stakeholders in defense of America’s statistical infrastructure. The initiative builds on PRB’s previous work as data intermediaries, including our American Community Survey Online Community, […]

https://rbfirehose.com/2025/06/18/data-rescue-project-why-were-starting-a-new-federal-data-forum/

Data Rescue Project: Why We’re Starting a New Federal Data Forum | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

#PublicDomain #books #datasets #Harvard #AI

'"The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data... To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006."'

https://dash.harvard.edu/entities/publication/ca12cc2e-b726-4896-ba06-4d7f4c35cd3a

Institutional Books 1.0: A 242B Token Dataset from Harvard Library's Collections, Refined for Accuracy and Usability

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

Harvard Library: Institutional Books 1.0: A 242B Token Dataset from Harvard Library’s Collections, Refined for Accuracy and Usability. “The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear […]

https://rbfirehose.com/2025/06/13/institutional-books-1-0-a-242b-token-dataset-from-harvard-librarys-collections-refined-for-accuracy-and-usability-harvard-library/

Institutional Books 1.0: A 242B Token Dataset from Harvard Library’s Collections, Refined for Accuracy and Usability (Harvard Library) | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz
Bidirectional Translations Between Observational And Topography-Based Hydrographic Data Sets - MERIT-Basins And The SWOT River Database (SWORD)
--
https://doi.org/10.1029/2024WR038633 <-- shared paper
--
#GIS #spatial #mapping #water #hydrology #model #modeling #SurfaceWater #Ocean #Topography #SWOT #SWORD #hydrography #stream #river #datasets #MERIT #basin #watershed #inventory #routing #network #discharge #flow #peakflow #remotesensing #satellite #earthobservation