Institutional Books: A 242B token dataset from Harvard Library's collections

https://arxiv.org/abs/2506.08300

#HackerNews #InstitutionalBooks #HarvardLibrary #TokenDataset #OpenData #DigitalCollections

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

arXiv.org
Article: “An Interface to View Collections of Visual Art” presents LadeCA.View—a visual tool to explore, describe, and analyze large image collections in the digital humanities.
https://link.springer.com/article/10.1007/s42803-022-00061-8
#DigitalHumanities #VisualCulture #DigitalArtHistory #InterfaceDesign #DigitalCollections #LadeCA #MuseumTech
An interface to view collections of visual art - International Journal of Digital Humanities

Art experts prefer being able to look at the individual images they are working on in the course of their research. However, if one were to look at digitally accessible images in the field of visual art, one would be dealing with billions of images; no one can handle visually examining such huge numbers of images one at a time. Therefore, art experts need special tools to examine and describe artworks in the context of other artworks. We used our experience from previous projects and interviews with members of the target group (art historians, curators, art dealers, and artists) to identify the central issues these experts encounter when working with large image collections and to determine the functionality and properties a system must offer to support their work. The results led to the customized interface LadeCA.View, which is now used in several projects. LadeCA.View enables experts to describe an exhibition or a collection of visual art in such a way that a user can obtain an overview of the intention, content, and structures of the exhibition or collection within a short period of time without looking at each image individually. LadeCA.View can also be used as an interface to probe more deeply into a collection or exhibition. In this paper we show the functions and visualizations of the interface and explain the design decisions. Furthermore, we outline LadeCA.View’s scope of applicability using three case studies

SpringerLink

Another great opportunity to join us at the University of Glasgow, as we invest and expand our digital capacity in #digitallibraries, #digitalarchives, #DigitalHumanities and #digitalcollections!

This particular post is Digital Research Collections Coordinator, overseeing the management and development of curated digital collections. Potential for some interesting #data wrangling too! Details below.

Digital Research Collections Co-ordinator https://www.jobs.gla.ac.uk/job/digital-research-collections-co-ordinator #Glasgow

Digital Research Collections Co-ordinator · University of Glasgow

Job PurposeTo manage and maintain efficient, high-quality workflows, processes and procedures for the delivery of the Digital Research Collections service.To...

Last-minute alert! A group I'm in is hosting a talk on Connect to Collect, a network of museums and archives collecting social digital photography. Katrina Hedström of Stockholm County Museum and Bente Jensen of Aalborg City Archives will discuss community engagement, collecting about/during traumatic events, and preserving documentation.

Join us Wed. 5/14 at 4pm BST/11 am EDT/8 am PDT on Zoom!

https://us06web.zoom.us/meeting/register/EwOAQQVoT9yWrz_kdXwpeQ#/registration

#archives #libraries #photography #preservation #digitalCollections

Welcome! You are invited to join a meeting: Connect to Collect: Preserving Social Digital Photography Through Collaboration and Community Engagement. After registering, you will receive a confirmation email about joining the meeting.

A Talk by Karolina Hedström, Stockholm County Museum, Sweden and Bente Jensen, Aalborg City Archives, Denmark This talk will present the Connect to Collect, a collaborative network of museums and archives dedicated to collecting and preserving social digital photography. Through the development of the innovative web app Connect to Collect, we aim to explore new methods for gathering and curating this vital form of contemporary heritage. Our current project emphasizes community engagement, employing citizen science approaches to address contemporary themes such as social and ecological sustainability, place, identity and sudden traumatic events (such as the 2017 Stockholm Terrorist Attack). Everyday digital photography—often shared on social media—represents a powerful documentation of modern society from the perspective of citizens and communities. Together, Nordic museums and archives involved in this initiative are pioneering participatory strategies to preserve this invaluable cultural heritage.

Zoom

Have never been to Florida but once upon a time FAU's Recorded Sound Archive Judaica collection was the top digitised Yiddish music archive, still in the top 5. I for one won't use it anymore and may be petty enough to remove old links to it from blog posts. Fuck 'em.
https://www.palmbeachpost.com/story/opinion/columns/2025/04/21/florida-atlantic-university-campus-police-ice-agents-immigration/83096056007/

#Florida #FAU #ICE #archives #DigitalCollections

FAU partners campus police with ICE, makes foreign students deportation targets | Opinion

Florida Atlantic University becomes first public university in Florida to partner campus police with immigration enforcement through ICE.

The Palm Beach Post

It's a New Year, so apply for a new job! Join us as a 'Digital Collections Systems Specialist' or 'Archivist (Digital Preservation)'. Deadline approaching for applications: **09 February**.

https://code4lib.social/@g3om4c/113861293985628356

#digipres #archives #DigitalArchives #DigitalCollections #DigitalLibraries #development #developer #jobs #Glasgow #Scotland

George Macgregor (@g3om4c@code4lib.social)

Sound the #jobs klaxon, folks!! 2025 brings new opportunities within Information Services at the University of #Glasgow! As part of our ongoing strategy to develop our digital offering and support emerging #digital initiatives, we are seeking candidates for the following two jobs: Digital Preservation Archivist https://www.jobs.gla.ac.uk/job/archivist-digital-preservation Digital Collections #Systems Specialist https://www.jobs.gla.ac.uk/job/digital-collections-systems-specialist Join us! Further info about relocating here; https://www.gla.ac.uk/myglasgow/pod/new/relocatingtoglasgow/ #digipres #DigitalArchives

code4lib.social

If you use digitized items (photos, manuscripts) or digitized historical audio/visual archives for your research, there is a new study out trying to understand how you find/use digitized, online library materials. The short, anonymous survey should take approximately 8-10 minutes to complete. https://forms.office.com/r/nuqD9ZJzFq

Send questions to Sarah Severson (sarah.severson@ualberta.ca) & Nailisa Tanner(nailisa.tanner@mcgill.ca) #digitalcollections #archives #research

Microsoft Forms

@nypl I love, love, LOVE the Digital Collections at NYPL!

Recently discovered this wonderful map of Great Kills Harbor and Crooke’s Point (then an island) in Staten Island.

#DigitalCollections #HistoricalMaps #GreatKills #StatenIsland #NYC

Digital Collections: Library of Congress Launches Digitized Collection of National AIDS Memorial Quilt Records

From the Library of Congress: The Library of Congress has released a groundbreaking online collection of the National AIDS Memorial Quilt Records, making one of the most poignant symbols of the AIDS epidemic in the United States available to a global audience. As the largest communal art project in the world, the AIDS Memorial Quilt […]

Library Journal infoDOCKET