https://doi.org/10.5860/ital.v44i4.17404
#ResearchData #DigitalCollections #libraries
Southern Methodist University: From rails to revolutions: New windows into the past in digital collections. “What do a sugar railway in Cuba, U.S. soldiers hunting Pancho Villa, and a priest blessing a taxi in Mexico have in common? They’re all part of a fascinating array of newly digitized materials now available in SMU’s Digital Collections.”
🎉Thrilled to share the publication of our #OpenAccess book 'Opening up our Heritage: Opportunities in Digitising and Promoting Cultural and Research #Collections', a collection of 19 chapters written by librarians and researchers on the #digitisation and promotion of cultural and scientific #heritage.
👉 HTML: https://e-publish.uliege.be/opening-up-our-heritage
👉 PDF and ePub: https://e-publish.uliege.be/opening-up-our-heritage/front-matter/free-download-buy/
#DigitalArchives #DigitalCollections #Preservation #Libraries #Metadata #Discoverability #OpenScience #Pressbooks
The CDNC includes content from hundreds of newspapers that have been published throughout the state, going back as far as 1846. As of this writing, there 23,449,221 pages in the CDNC archive—but the staff that managed the project was terminated.
Institutional Books: A 242B token dataset from Harvard Library's collections
https://arxiv.org/abs/2506.08300
#HackerNews #InstitutionalBooks #HarvardLibrary #TokenDataset #OpenData #DigitalCollections
Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.
Did you know the Share button in Canopy encodes links in Content State? You can use the encoded parameter value to reopen your state in another viewer like Clover or Theseus to jump right back to that exact view in Canopy. Handy for collaboration and citation! #IIIF #DigitalCollections #Canopy
Art experts prefer being able to look at the individual images they are working on in the course of their research. However, if one were to look at digitally accessible images in the field of visual art, one would be dealing with billions of images; no one can handle visually examining such huge numbers of images one at a time. Therefore, art experts need special tools to examine and describe artworks in the context of other artworks. We used our experience from previous projects and interviews with members of the target group (art historians, curators, art dealers, and artists) to identify the central issues these experts encounter when working with large image collections and to determine the functionality and properties a system must offer to support their work. The results led to the customized interface LadeCA.View, which is now used in several projects. LadeCA.View enables experts to describe an exhibition or a collection of visual art in such a way that a user can obtain an overview of the intention, content, and structures of the exhibition or collection within a short period of time without looking at each image individually. LadeCA.View can also be used as an interface to probe more deeply into a collection or exhibition. In this paper we show the functions and visualizations of the interface and explain the design decisions. Furthermore, we outline LadeCA.View’s scope of applicability using three case studies
Another great opportunity to join us at the University of Glasgow, as we invest and expand our digital capacity in #digitallibraries, #digitalarchives, #DigitalHumanities and #digitalcollections!
This particular post is Digital Research Collections Coordinator, overseeing the management and development of curated digital collections. Potential for some interesting #data wrangling too! Details below.
Digital Research Collections Co-ordinator https://www.jobs.gla.ac.uk/job/digital-research-collections-co-ordinator #Glasgow