There's an extended version of the "Reliable editions from unreliable components" on arXiv,
https://arxiv.org/abs/2204.01638Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models
A profile hidden Markov model, a popular model in biological sequence
analysis, can be used to model related sequences of characters transcribed from
books, magazines, and other printed materials. This paper documents one
application of a profile HMM: automatically producing an ebook edition from
distinct print editions. The resulting ebook has virtually all the desired
properties found in a publisher-prepared ebook, including accurate
transcription and an absence of print artifacts such as end-of-line hyphenation
and running headers. The technique, which has particular benefits for readers
and libraries that require books in an accessible format, is demonstrated using
seven copies of a nineteenth-century novel.
arXiv.orgNew article in JCDL '22: "Reliable editions from unreliable components: estimating ebooks from print editions using profile hidden markov models."
https://doi.org/10.1145/3529372.3533292 #jcdlNew preprint with Troy Bassett, "What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels"
https://arxiv.org/abs/2009.00513
What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels
Library digitization has made more than a hundred thousand 19th-century
English-language books available to the public. Do the books which have been
digitized reflect the population of published books? An affirmative answer
would allow book and literary historians to use holdings of major digital
libraries as proxies for the population of published works, sparing them the
labor of collecting a representative sample. We address this question by taking
advantage of exhaustive bibliographies of novels published for the first time
in the British Isles in 1836 and 1838, identifying which of these novels have
at least one digital surrogate in the Internet Archive, HathiTrust, Google
Books, and the British Library. We find that digital surrogate availability is
not random. Certain kinds of novels, notably novels written by men and novels
published in multivolume format, have digital surrogates available at
distinctly higher rates than other kinds of novels. As the processes leading to
this outcome are unlikely to be isolated to the novel and the late 1830s, these
findings suggest that similar patterns will likely be observed during adjacent
decades and in other genres of publishing (e.g., non-fiction).
arXiv.orgNew link blog, Beyond Seven Review:
https://www.beyondseven.org/. h/t Mark Sample.
New paper with Troy Bassett, "The Class of 1838: A Social History of the First Victorian Novelists"
https://osf.io/9p3tc/ #bookhistory #raymondwilliams #≥1789