Mastodawn

Waldo Jaquith Sep 5, 2024

Over a decade ago, I worked on a presidential papers project. The audacious goal was to scan in all presidential papers, make them available for download, and extract any possible data. But until the advent of the typewriter, virtually no data *could* be extracted, other than the odd letterhead. My proposal was to collect the images, build a processing pipeline, and when OCR of handwriting was possible, do it then.

Well, ChatGPT *nailed* this. So many handwritten documents can be discoverable!

Show thread

Waldo Jaquith Sep 5, 2024

OCRing handwriting is a vastly more valuable use of LLMs than chatbots or image generation. I spent years of my career on OCRing big corpuses of text, and boy was it bad. I love the idea of a small LLM optimized for handwriting recognition. The National Archives and the Library of Congress both contain huge amounts of valuable information that’s hard to read for humans and unsearchable (and I'm sure there are lots of other such collections). It's nice seeing a legitimately good LLM use case.

Show thread

chx

@waldoj would this work for other writing system/languages? Back in Hungary there are several archives worth of various records few can read any more because the old handwriting was based on some German cursive and it's completely illegible except for a few scholars (one of them happens to be my brother).

Show thread

Waldo Jaquith Sep 5, 2024

@chx Hypothetically, yes, but I assume that LLMs are heavily imbued with the biases of its creators, so e.g. lots of English-centricism. But there's no reason why an LLM couldn't be trained on those documents for which there are machine-readable translations.