Mastodawn

Over a decade ago, I worked on a presidential papers project. The audacious goal was to scan in all presidential papers, make them available for download, and extract any possible data. But until the advent of the typewriter, virtually no data *could* be extracted, other than the odd letterhead. My proposal was to collect the images, build a processing pipeline, and when OCR of handwriting was possible, do it then.

Well, ChatGPT *nailed* this. So many handwritten documents can be discoverable!

Waldo Jaquith Sep 5, 2024

OCRing handwriting is a vastly more valuable use of LLMs than chatbots or image generation. I spent years of my career on OCRing big corpuses of text, and boy was it bad. I love the idea of a small LLM optimized for handwriting recognition. The National Archives and the Library of Congress both contain huge amounts of valuable information that’s hard to read for humans and unsearchable (and I'm sure there are lots of other such collections). It's nice seeing a legitimately good LLM use case.

Nelson Minar Sep 5, 2024

@waldoj this would be super useful for my historical society archives. we have an enormous amount of handwritten ledgers, property records, etc. no page is valuable enough to be worth hand transcribing but a good-enough OCR pass would unlock a lot.

Waldo Jaquith Sep 5, 2024

@nelson Oh, that's a great point—I bet that's true of every historical society, nearly all of which have vanishingly few resources for this kind of thing.

Nelson Minar Sep 5, 2024

@waldoj I've heard rumors the Ancestry folks (and related LDS genealogy groups) have very good software for doing this for census records, etc. I haven't looked into it. They also have done a lot of manual transcription.

mnl mnl mnl mnl mnl Sep 5, 2024

@waldoj @nelson not for handwriting, but shows how far you can get as an individual and custom models: https://github.com/VikParuchuri/surya

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, table recognition in 90+ languages

OCR, layout analysis, reading order, table recognition in 90+ languages - VikParuchuri/surya

GitHub

Waldo Jaquith Sep 5, 2024

@nelson IME, processing handwritten text in forms is a lot easier than unstructured text, because most fields have a constrained set of possible values.

Jeff Atwood Sep 5, 2024

@waldoj that and adding alt text to images automatically for the vision impaired, etc

Tim McCormack Sep 5, 2024

@codinghorror @waldoj I've seen a lot of AI-captioned images and it ranges from useless to misleading.

And then you get horrifically wrong interpretations like this one: https://mastodon.social/@philipncohen/113080493757208395 (see post text for AI output -- it's not in the preview)

Waldo Jaquith Sep 5, 2024

@timmc @codinghorror I’d love to hear from some blind and visually-impaired folks whether they find that they prefer AI-generated alt text to no alt text. (Yes, far better is the option of human-written alt text, but it’s been 24 years and we haven’t made much progress on that front.)

I hope somebody is conducting a study on this!

Tim McCormack Sep 6, 2024

@waldoj @codinghorror I'd love to see a study as well.

I've seen some reports on the matter from blind folks, but it was long enough ago that I can't recall the details. However, I do know that AI generated captions for videos are widely despised by the visually impaired—maybe better than nothing, maybe not—and I feel like those captions are generally in better shape than the generated alt text I've seen.

Jed Brown Sep 5, 2024

@timmc @codinghorror @waldoj Misgendering alt-text in emails is an insidious one that we've been seeing in the wild. The author could have inspected/approved the generated text, it would be harassment if they did (or had written the caption themselves), but they also have plausible deniability.

The sender isn't supplying the image-to-text tool with more information so recipients could run their own tool and get the same quality results (plus ability to consult multiple tools/upgraded versions).

nothingfuture Sep 5, 2024

@waldoj Someone on my crew has been working to start building out a LLM/ml driven OCR to LaTeX pipeline, and it’s (potentially) such a useful thing to have around. SO MANY LLM uses are just trash (any flavor trash you like!) and it’s nice to find one that might actually do some good, you know?

Waldo Jaquith Sep 5, 2024

@nothingfuture That's great!

Morten Sep 5, 2024

@nothingfuture @waldoj Also a great project because you can hook it up to reinforcement learning, as well as running the LaTeX before returning an answer to give the model a couple of tries more to get it right.

gary Sep 5, 2024

@waldoj llm have a long runway - why i think they iterate so often but specifically models will get more specialized and then people can combine all the specialized models into one giant model. also hardware is going to get so much better we can expect, dare i say it, even more growth but not in just big tech and enterprise - it will expand out to smb and prosumer

Kevin Patrick Doyle 🍅🥫Sep 5, 2024

@waldoj @anildash my office helped facilitate, through tens of thousands of cloud credits, the OCRing of issues of old newspapers and it warms my heart to know how much inaccessible data is now catalogued because of it.

Just a stupendously better use of resources

https://dell-research-harvard.github.io/projects/393as

American Stories

Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring. “American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers.

Ted Mielczarek Sep 5, 2024

@waldoj oh man, when I was reading Chernow's bio of Washington I went looking for his diaries to track down some fact that he cited and *I* could barely read his handwriting.

chx Sep 5, 2024

@waldoj would this work for other writing system/languages? Back in Hungary there are several archives worth of various records few can read any more because the old handwriting was based on some German cursive and it's completely illegible except for a few scholars (one of them happens to be my brother).

Waldo Jaquith Sep 5, 2024

@chx Hypothetically, yes, but I assume that LLMs are heavily imbued with the biases of its creators, so e.g. lots of English-centricism. But there's no reason why an LLM couldn't be trained on those documents for which there are machine-readable translations.

🇳🇱 🇪🇺Jeroen 🇺🇦 🇺🇦Sep 5, 2024

@waldoj In a similar vein, a use of llm’s I actually found reasonable is “translating” old texts into modern language. I don’t know how severe this is with old English, but at least for me old Dutch is distractingly difficult to interpret because of the changes in the language over a few centuries. An LLM can quickly turn that into an easy to read modern version.

Andreas 'count' Kotes (he/him)Sep 5, 2024

@waldoj @stralau but that would be using Machine Learning (ML), not generative AI via a Large Language Model (LLM)?

it'd even make it worse, as it would just hallucinate stuff it couldn't detect correctly, replacing stuff that's inaccessible with stuff that is potentially inaccurate or even plain wrong.

e.g. https://www.crikey.com.au/2024/09/03/ai-worse-summarising-information-humans-government-trial/

AI worse than humans in every way at summarising information, government trial finds

A test of AI for Australia's corporate regulator found that the technology might actually make more work for people, not less.

Crikey

Waldo Jaquith Sep 5, 2024

@count @stralau OCR already hallucinates stuff it can’t detect properly—this is no different. I’m aware of no ML breakthroughs in handwriting OCR, but LLMs sure seem to be good at it.

Andreas 'count' Kotes (he/him)Sep 5, 2024

@waldoj @stralau I'm not sure we share the same understanding of ChatGPT architecture ☺️

Ford Prefect Sep 5, 2024

@waldoj Was this the plain ChatGPT available to anyone, or was this a specialized sub-model?
I'm waiting for this as well, as it could revolutionize genealogy. All written records suddenly searchable - what a treasure!

Waldo Jaquith Sep 5, 2024

@fordprefect This was ChatGPT 4o, not a special model.

raphael Sep 5, 2024

@waldoj yup, we’re developing ocr4all (https://www.ocr4all.org) at my workplace for that, and are using it with good success on stuff like medieval manuscripts and the like. as i understand it, handwritten text requires a couple pages of transcription for training a bespoke model, and than then OCRs the rest of the document.

as uses of ‘AI’ go, it’s my go-to example for genuinely helpful ones.

OCR4all | Setup guide, user guide, developer documentation and more.

Guides, documentation and more

Waldo Jaquith Sep 5, 2024

@gekitsu Cooooool!

Aaron Brick — אהרן בריק Sep 6, 2024

@gekitsu @waldoj Everything I see on ocr4all.org is about printed text, so the work on manuscripts is an off-label or experimental deployment? I think a project called Transkribus has done the most on this, but as you said, the training requirement is crucial, and that ties the model to a specific scribe, which is no good for whole files or collections. Anyway I would love to read more about any HTR that is happening in your organization.

raphael Sep 6, 2024

@aarbrk @waldoj i wouldn’t say off-label, since the very first sentence of the about page (‘what is ocr4all’) reads:

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.

but yeah, you are correct both in transkribus being the big player in the field, as it were, and of the specificity problem. i’d have to ask whether it’s feasible to make a more general use model that is trained on multiple hands in the same script. (and i’m currently on vacation, so i’d have to get back to you on that.)

the aspects of HTR i’m familiar with are from the perspective of someone liasing with the domain experts who are partnering with us for digital edition projects, so it’s mostly at a bit of a distance. but from the project i’m currently assigned to, we’ve HTR-ed two medieval dutch manuscripts (14th century verse and 15th century prose) and that seemed to go over well with the philologist who is now working on the texts.

Aaron Brick — אהרן בריק Sep 6, 2024

@gekitsu @waldoj Thank you for the correction, I did miss that.

I have an article under peer review right now about digitizing a particular collection of 19th century manuscripts. I wrote it a couple of years ago and am imagining that the reviewers come back with requests to elaborate on new software solutions. In that case, about 12 professional scribes collaborated to write about 23,000 pages of text. With another researcher we have experiments to detect the hand shifts, which is possible.

Aaron Brick — אהרן בריק Sep 6, 2024

@gekitsu @waldoj If your organization ever wants more collaborators, let me know. I am particularly interested in HTR as inputs for double-keying texts to create scholarly-quality editions (i.e., with a human also vetting, which could be what you described above.)

ManniCalavera Sep 5, 2024

@waldoj I am not sure how helpful this is, but this seems to be at least in the same (probably very big) ballpark.
https://hcommons.social/@benwbrum/113034269440259299

Ben W. Brumfield (@[email protected])

I'm pleased to announce that Sara and I are part of a team that have been awarded an NEH Digital Humanities Advancement Grant. The team (led by Lindsey Peterson of University of South Dakota and the Civil War and Reconstruction Governors of Mississippi documentary edition and Elisabeth La Beaud of the Mississippi Digital Library) will be developing "AI-based software that uses named entity recognition and large language models to automatically create subject tags for digitized cultural heritage materials to enhance searches and usability." In short, we're trying to make the emerging tools for recognizing and identifying entities within text actually usable for librarians, archivists, and scholarly editors. Here is the full list of awards: https://www.neh.gov/sites/default/files/2024-08/NEH%20August%202024%20grants%20list%20state%20by%20state.pdf

hcommons.social

Waldo Jaquith Sep 5, 2024

@ManniCalavera @benwbrum That’s so great! Yay, Ben!

Ben W. Brumfield Sep 5, 2024

@waldoj @ManniCalavera

Thanks! We're doing a lot with HTR, and I think that LLMs definitely play a role. However, we need new metrics to evaluate LLM outputs, since traditional OCR/HTR produces obvious errors, while a LLM's errors are seductively plausible.

Ben W. Brumfield Sep 5, 2024

@waldoj @ManniCalavera cf. https://hcommons.social/@benwbrum/112444754060434111

Ben W. Brumfield (@[email protected])

Attached: 1 image I'm reading a lot about #ChatGPT4o and #htr, so I figured I'd spot-check it with this document: https://fromthepage.com/lva/va-revcon-74-76/work-7811594-024/display/33996304 The results were better than nothing, but still not great. In particular, it was not able to read Lord Dunmore's name, nor read either of two references to "Emancipate our Slaves". If you've been following historical debates about the #1619project , you can imagine how problematic it would be to rely on this transcription for full-text search.

hcommons.social

Frederik Elwert Sep 6, 2024

@benwbrum @waldoj @ManniCalavera That's a good point! I was wondering how LLMs would compare to dedicated HTR models. I can see the benefit of not having to train models, but iirc, #Transkribus also released a universal model? And I do see the danger of over-correction/hallucinating. I know some transcription projects let non-native speakers transcribe their texts so that they wouldn't unconsciously fix mistakes in the original. I guess LLMs would be very prone to this.

Frederik Elwert Sep 6, 2024

@benwbrum @waldoj @ManniCalavera We quite clearly see this in audio transcription. #Whisper does hallucinate sometimes, but even if it doesn't, it produces a very much smoothed interpretation of what was said. That's undesired e.g. for social science research where pauses, interruptions, repetitions etc. are important for later analysis.

Frederik Elwert Sep 6, 2024

@benwbrum @waldoj @ManniCalavera But of course requirements differ, and if your focus is on document retrieval, it might still be better then nothing.

Ben W. Brumfield Sep 6, 2024

@felwert @waldoj @ManniCalavera

I think you're right that we have to consider fitness for purpose when choosing engines, and I suspect that we will end up with multiple derivative transcriptions for different uses.

For example, researchers using screen readers are benefited by over-correction, especially modernized punctuation and orthography. That's the opposite of scholarly best practice for diplomatics.

By contrast, one of our use-cases for raw HTR is determining the language of a document. Even #transkribus will hallucinate nonsense phrases in French or German when passed an image of a page that is blank except for ink bleed-through. For most purposes, this is no different from the gibberish punctuations produced by OCR, but for language determination it's a problem!

Morgan Aldridge Sep 5, 2024

@waldoj My #AppleNewton #MessagePad still uses #AI handwriting recognition. While the the earlier models had a bad rap (#EggFreckles), the #NewtonOS 2.1 models were quite good.

Waldo Jaquith Sep 5, 2024

@morgant I loved my Newton. It was what made me an Apple convert.

Morgan Aldridge Sep 5, 2024

@waldoj Wow, that is not an answer I was expecting and a much appreciated surprise!

Kee Hinckley Sep 5, 2024

@waldoj I'm rather amazed to discover that even my iPhone can read some of that. Not great. But way better than I would have expected.

May, 10C€ 1852
My Dean Mif talerton,
have your not of datado
sohaiting the appointment of billian Midoud, a young artist li some consulate in Pal, that shall enable him to pay his way and purene his etudies, and regret to vary that there is se such place, which is vacant. the consulale at home is filled, and those at Laghorn and Florence gild only from 200 lo 50 anyear, and are whelly in. adequate for the purpores which you daine.
Same goin clisert
Millar Kenno
AC. 1443

Waldo Jaquith Sep 5, 2024

@nazgul I tested the same thing last night, and my result was about the same as yours—not great, but vastly better than I guessed.

Morten Sep 5, 2024

@nazgul @waldoj Washington ly May, 100% 1852

My Dear Miss Waterston,

Shave your note of Satunday sobating the appointinent of William M.Sond, gouung artist lo some consulate in Italy, that shall enable him to pay his way and pursue his studies, and. such place which regret to say that there is no is vacant. The consulate at home filled, and those at Leghorn and Florence yield only from 200 to 250- a year, and are wholly in.adequate for the purposes which you desire. Saw your oft sert

AC. 1443

Morten Sep 5, 2024

@nazgul @waldoj Google Lens (it seems to have skipped the signature)

I had to delete some new lines between in. and adequate to make it fit.

Kee Hinckley Sep 5, 2024

@drgroftehauge @waldoj This is actually of more than abstract interest to me. I have a journal my father kept of the first year he was dating my mother. It's handwritten, but neat and legible (much better than that image). It's very cool to think I might be able to easily transcribe that now.

mfriedenhagen Sep 5, 2024

@waldoj slightly OT: would one still write Washington City? I assume it is to either distinguish, from the state/territory or the former president? According to Wikipedia even the territory was only formed one year after this letter was written.

Waldo Jaquith Sep 5, 2024

@mfriedenhagen Interesting, I hadn’t even noticed that! No, nobody would write “Washington Ciry” today—I’ve never seen that before.

Sam Sep 17, 2024

@waldoj I am archiving thousands of letters currently. Claude Sinnet 3.5 is astounding at reading handwriting. It can often read words I cannot. It’s very close to perfect, even on the letters I am working with which date from WW2.