Mastodawn

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter https://github.com/simonw/tools/blob/9fb049424f4ec8f8ffb91a59ab7111cad56088fc/ocr.html

tools/ocr.html at 9fb049424f4ec8f8ffb91a59ab7111cad56088fc · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 https://github.com/naptha/tesseract.js/commit/906ce3cadbffaf5f7317a4418f282c4b78bf8385

First to PDF.js was Apr 25, 2011 https://github.com/mozilla/pdf.js/commit/6dc1770bba7a417ce5664c0305469e5bb7ea76bd

init · naptha/tesseract.js@906ce3c

Pure Javascript OCR for more than 100 Languages 📖🎉🖥 - init · naptha/tesseract.js@906ce3c

GitHub

https://github.com/simonw/textract-cli

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

GitHub - simonw/textract-cli: CLI for running files through AWS Textract

CLI for running files through AWS Textract. Contribute to simonw/textract-cli development by creating an account on GitHub.

GitHub

https://tools.simonwillison.net/ocr

New feature for my browser-based OCR tool: you can now select the Tesseract.js language to use, from a list of 102 options

OCR PDFs and images directly in your browser

One tiny extra detail which possibly only I care about: changing the selection in the language select now updates a ?language=x query string, so you can bookmark a language and the back/forward buttons navigate through that selected state

Here's OCR for Welsh, bookmarked: https://tools.simonwillison.net/ocr?language=cym

OCR PDFs and images directly in your browser

https://tools.simonwillison.net/ocr?language=frm

Anyone get any documents lying around in Middle Ages French, circa 1400-1600?

Apparently Tesseract / Tesseract.js can handle them, so I'd love to see my tool try!

OCR PDFs and images directly in your browser

... definitely going to stop tinkering with this thing now, but I did add a few basic automated tests just now using Playwright Python https://github.com/simonw/tools/blob/main/tests/test_ocr.py - and a tiny bit of assistance from Claude 3 Opus https://github.com/simonw/tools/issues/8#issuecomment-2029152772

tools/tests/test_ocr.py at main · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub

Joel Kin Apr 1, 2024

@simon that’s the claim this band makes, though I can’t say I have the wherewithal to verify it! https://genius.com/albums/Vehemence-fr/Ordalies

Joel Kin Apr 1, 2024

@simon I do have a pdf of a French play from around 1630 if you’d like

@frueheneuzeit Apr 1, 2024

@simon you should get in touch with this guy: https://huggingface.co/Pclanglais Alexander Doria on Twitter and bluesky.

Pclanglais (Pierre-Carl Langlais)

Tintin & Mickey!

@simon https://gallica.bnf.fr/ark:/12148/btv1b86000209/f14.item

Johan Richer Apr 1, 2024

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme -- 1535 -- livre

Gallica

: j@fabrica:~/src;

Apr 1, 2024

@simon Now I’m wondering about Linear A

Mia Apr 1, 2024

@simon https://manuscrits-france-angleterre.org/polonsky/en/content/accueil-en (but only the BnF ones display at the moment because of the BL situation)

France-England: medieval manuscripts between 700 and 1200

rd_palmer Apr 1, 2024

@simon perhaps some of use in Alix Chagué and Thibault Clérice's project to provide a catalogue of training datasets: https://htr-united.github.io/catalog.html

HTR-United

HTR-United is a catalog and an ecosystem for sharing and finding ground truth for optical character or handwritten text recognition (OCR/HTR).

HTR-United

alexwlchan Apr 1, 2024

@simon 1470, French, seems like a good bet: https://wellcomecollection.org/works/eu3ym7su/items?canvas=8

<i>Livre des simples médecines</i>, in French

Wellcome Collection

https://aws-samples.github.io/amazon-textract-textractor/commandline.html

Sym Roe Mar 31, 2024

@simon The (oddly hard to find) Textractor python library does this nicely, with async interface too:

> pip[x] install amazon-textract-textractor
> textractor detect-document-text your_file.png output.json

But maybe it's processing the output into something useful that you needed? Parsing their JSON can be tricky, but that library also has a Document class with handy `to_markdown` or `to_pandas` methods

CLI — amazon-textract-textractor 1.0.0 documentation

@symroe well that would have saved me a bit of time! Thanks for the link, I'll add that to the textract-cli README

Sym Roe Mar 31, 2024

@simon I also built about 70% of a DIY solution before finding it! 🤷‍♂️

Julia Mar 30, 2024

@simon Tesseract (the non-JS version) was originally created by HP in the 1980s and open-sourced in 2005.

Brandon Biggs Mar 30, 2024

@simon this is super cool!

Molly White Mar 30, 2024

@simon nice! i was thinking of trying to do something similar to autogenerate alt text, which i currently tend to do by opening images in chrome and using google lens (far too many clicks)

@molly0xfff Yes! I first used something like this for the alt text in my annotated presentation tool here: https://til.simonwillison.net/tools/annotated-presentations

Annotated presentation creator

James Young Mar 31, 2024

@molly0xfff @simon in case you didn't know: both Mastodon's web version and Ivory offer OCR for uploaded images

aaron schaffer Mar 30, 2024

@simon Very cool. Though I get a Heroku error when I try to go to your site ("Application error: An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command heroku logs --tail")

@aaronjschaffer Huh... it looks like it's the Mastodon effect, where sending out a link causes thousands of Mastodon servers to all hit /.well-known/webfinger?resource=acct:[email protected] at the same time - but I've survived these storms just fine in the past, not sure why it's hurting the site today

aaron schaffer Mar 30, 2024

@simon Ah gotcha! I love a little suspense, I'll check again later!

@aaronjschaffer Worked through it here, should be working OK again now https://github.com/simonw/simonwillisonblog/issues/415

Get Cloudflare to cache /.well-known/webfinger · Issue #415 · simonw/simonwillisonblog

I'm getting a huge flurry of hits to this URL right now because I tooted a link: https://simonwillison.net/.well-known/webfinger?resource=acct:[email protected] It seems to have made my site ...

GitHub

Nelson Minar Mar 30, 2024

@simon under 8MB! 3MB each for Tesseract WASM and the training data.

Prem Kumar Aparanji 👶🤖🐘Mar 30, 2024

@simon need more such browser-only, offline-first, privacy-first apps that don't require any installation or configuration!

https://bne.social/@simon/112057608292224084

Stuart Gray Mar 30, 2024

@prem_k @simon If you didn't see it at the time, this was quite a cool offline browser-based transcription tool posted a few weeks back:

Like you, I love these kinds of tools but if I could *beg* the authors for one feature - please make it easy to download the needed files so I can run it all truly offline :)

Simon Elvery (@[email protected])

I wrote about scratching my own itch and building a transcription tool. It's completely private, neither the audio or the transcript ever leaves your browser. If this is the kind of tool you use, I'd love to hear your feedback (both on the write-up and the tool). https://elvery.net/drzax/cobbling-together-a-private-machine-transcription-and-editing-tool/

bne.social

@StuartGray @prem_k That really is a worthwhile feature for this one, I've opened an issue - no promises I'll solve it though, there are things in there relating to bundling that I don't know how to do yet https://github.com/simonw/tools/issues/2

Version of OCR that can run entirely offline · Issue #2 · simonw/tools

Currently https://tools.simonwillison.net/ocr loads assets from a CDN. A version that can run offline would be fantastic. It would be a tiny bit tricky to get versions of PDF.js and Tesseract.js (a...

GitHub

gabi Mar 30, 2024

@simon Insanely cool! It works fine in Android Chrome (no luck with Firefox though).

Alexander Winkler Mar 30, 2024

@simon not sure if this is of interest to @jbaiter , cf. https://openbiblio.social/@jbaiter/110815957206638047

Johannes Baiter (@[email protected])

Attached: 1 video 👀 #demotime Ever wish you could search through a #IIIF manifest, but the provider had neither #OCR, nor a #ContentSearch endpoint available? 🪄 You can soon help yourself: Fully client-side OCR and Content Search + Autocomplete for Mirador 3. And it survives page reloads! ✨

OpenBiblio.Social

Jage Mar 30, 2024

@simon I love reading about your process. It's been so fun to create small applets using AI with a bit of human assist. Do you think the Tesseract OCR has improved over the past few years? I remember it being quite sloppy back in the day.

@Jage it definitely has - they moved to a fancy LSTM neural network based thing within the last 5 years I think

Jage Mar 30, 2024

@simon Very cool. Thanks for sharing.

greatblueheron Mar 30, 2024

@simon This is very cool. I’ve been wondering for a while if there’s a similar tool for handwritten text. One of my family members left behind journals that are very difficult to read, though I can make out letters here and there. It seems like it should be possible to make some sort of trainable, handwriting reader, but I don’t know if anyone’s done that.

@greatblueheron AWS Textract works really well with handwriting in my experience

circafuturum Mar 30, 2024

@simon I really like this! Really nice to read the JS, HTML, CSS all in one file too 👏 awesome! I find projects putting models in browsers super interesting right now!

8thcross Mar 30, 2024

@simon interested to hear more about your analysis on "very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently—I’ll write more about that soon. But those tools are still inconvenient for most people to use."

@8thcross this is the issue to watch https://github.com/simonw/llm/issues/331

Multi-modal support for vision models such as GPT-4 vision · Issue #331 · simonw/llm

https://platform.openai.com/docs/guides/vision I think this is best handled by command line options --image and --image-urls to either encode and pass as base64, or to pass a URL.

GitHub

Kaveinthran (no longer here)Mar 31, 2024

@simon Hi are you also looking at upgrading the OCR engines, probably using VikParuchuri/surya https://github.com/VikParuchuri/surya

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, table recognition in 90+ languages

OCR, layout analysis, reading order, table recognition in 90+ languages - VikParuchuri/surya

GitHub

Kaveinthran (no longer here)Mar 31, 2024

@simon Also can add Surya layout https://x.com/vikparuchuri/status/1772700744673583424?s=46&t=LkACU5SURZ83u1uLZ5iBYw

Vik Paruchuri (@VikParuchuri) on X

Announcing surya layout! It detects tables, images, figures, section headers, and more. It works with any language, and a variety of document types. Find it here - https://t.co/DD2HfwI8jK . Thanks @LambdaAPI for sponsoring compute.

X (formerly Twitter)

@kaveinthran for this particular project I'm only looking at libraries I can run in a web browser, but thanks for the tip - I hadn't seen that one before

Mark Crocker Mar 31, 2024

@simon how does it do with data in boxes? Most of what I have to do OCR on are forms, and most of the data is in boxes. OCR that reads whole lines, or even columns, has problems with this.

@mcrocker pretty badly - Tesseract isn't the best tool for that. I would expect AWS Textract or maybe Claude 3 Opus or Gemini Pro 1.5 to handle those better, though they still aren't completely error free in my experience

Mandar Vaze (desipenguin)Mar 31, 2024

@simon Simon (and/or anyone in this thread) :

Is there a good tool/library for extracting text from handwritten note (converted to image via photo) ?
Tesseract doesn't work well.
I tried Google lens with better result, but that mean I need to upload the image to their server.

#ocr for #handwritten text

Mandar Vaze (desipenguin)Apr 1, 2024

@simon FWIW, I ran OCR on image from https://hamel.dev/blog/posts/evals/ - the post you shared earlier.
I'm pretty sure this is created via Excalidraw (or similar tool) and text is a font.
But OCR was 50% correct at best.

Your AI Product Needs Evals –

How to construct domain-specific LLM evaluation systems.

@mandarvaze I'm not particularly surprised - I don't think Tesseract is very good at illustrations, or indeed anything that's not regular "typewritten" text

But for the boring stuff it works fantastically well

Mandar Vaze (desipenguin)Apr 1, 2024

I came across https://notes.joeldare.com/handwritten-text-recognition from Simon's post being discussed on HN

Handwritten Text Recognition

Jorge Maroto Mar 31, 2024

@simon good to know about this tool. I’ve been playing lately with tabula to extract data from long PDFs, and it works very nice with structured data, but for the plain text it doesn’t work smooth, so I’ll take a look at this