Mastodawn

Allen Riddell Jul 12, 2022

There's an extended version of the "Reliable editions from unreliable components" on arXiv, https://arxiv.org/abs/2204.01638

Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models

A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.

arXiv.org

Allen Riddell Jul 12, 2022

New article in JCDL '22: "Reliable editions from unreliable components: estimating ebooks from print editions using profile hidden markov models." https://doi.org/10.1145/3529372.3533292 #jcdl

Allen Riddell Sep 7, 2020

New preprint with Troy Bassett, "What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels" https://arxiv.org/abs/2009.00513

What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels

Library digitization has made more than a hundred thousand 19th-century English-language books available to the public. Do the books which have been digitized reflect the population of published books? An affirmative answer would allow book and literary historians to use holdings of major digital libraries as proxies for the population of published works, sparing them the labor of collecting a representative sample. We address this question by taking advantage of exhaustive bibliographies of novels published for the first time in the British Isles in 1836 and 1838, identifying which of these novels have at least one digital surrogate in the Internet Archive, HathiTrust, Google Books, and the British Library. We find that digital surrogate availability is not random. Certain kinds of novels, notably novels written by men and novels published in multivolume format, have digital surrogates available at distinctly higher rates than other kinds of novels. As the processes leading to this outcome are unlikely to be isolated to the novel and the late 1830s, these findings suggest that similar patterns will likely be observed during adjacent decades and in other genres of publishing (e.g., non-fiction).

arXiv.org

Allen Riddell Sep 7, 2020

New link blog, Beyond Seven Review: https://www.beyondseven.org/. h/t Mark Sample.

Beyond Seven Review

Allen Riddell Oct 18, 2019

New paper with Troy Bassett, "The Class of 1838: A Social History of the First Victorian Novelists" https://osf.io/9p3tc/ #bookhistory #raymondwilliams #≥1789