I built a new tool: https://tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

OCR PDFs and images directly in your browser

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter https://github.com/simonw/tools/blob/9fb049424f4ec8f8ffb91a59ab7111cad56088fc/ocr.html

tools/ocr.html at 9fb049424f4ec8f8ffb91a59ab7111cad56088fc · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 https://github.com/naptha/tesseract.js/commit/906ce3cadbffaf5f7317a4418f282c4b78bf8385

First to PDF.js was Apr 25, 2011 https://github.com/mozilla/pdf.js/commit/6dc1770bba7a417ce5664c0305469e5bb7ea76bd

init · naptha/tesseract.js@906ce3c

Pure Javascript OCR for more than 100 Languages 📖🎉🖥 - init · naptha/tesseract.js@906ce3c

GitHub

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

https://github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

GitHub - simonw/textract-cli: CLI for running files through AWS Textract

CLI for running files through AWS Textract. Contribute to simonw/textract-cli development by creating an account on GitHub.

GitHub

New feature for my browser-based OCR tool: you can now select the Tesseract.js language to use, from a list of 102 options

https://tools.simonwillison.net/ocr

OCR PDFs and images directly in your browser

One tiny extra detail which possibly only I care about: changing the selection in the language select now updates a ?language=x query string, so you can bookmark a language and the back/forward buttons navigate through that selected state

Here's OCR for Welsh, bookmarked: https://tools.simonwillison.net/ocr?language=cym

OCR PDFs and images directly in your browser

Anyone get any documents lying around in Middle Ages French, circa 1400-1600?

Apparently Tesseract / Tesseract.js can handle them, so I'd love to see my tool try!

https://tools.simonwillison.net/ocr?language=frm

OCR PDFs and images directly in your browser

... definitely going to stop tinkering with this thing now, but I did add a few basic automated tests just now using Playwright Python https://github.com/simonw/tools/blob/main/tests/test_ocr.py - and a tiny bit of assistance from Claude 3 Opus https://github.com/simonw/tools/issues/8#issuecomment-2029152772
tools/tests/test_ocr.py at main · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub
@simon that’s the claim this band makes, though I can’t say I have the wherewithal to verify it! https://genius.com/albums/Vehemence-fr/Ordalies
@simon I do have a pdf of a French play from around 1630 if you’d like
@simon you should get in touch with this guy: https://huggingface.co/Pclanglais Alexander Doria on Twitter and bluesky.
Pclanglais (Pierre-Carl Langlais)

Tintin & Mickey!

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme -- 1535 -- livre

Gallica
@simon Now I’m wondering about Linear A
@simon https://manuscrits-france-angleterre.org/polonsky/en/content/accueil-en (but only the BnF ones display at the moment because of the BL situation)
France-England: medieval manuscripts between 700 and 1200

@simon perhaps some of use in Alix Chagué and Thibault Clérice's project to provide a catalogue of training datasets: https://htr-united.github.io/catalog.html
HTR-United

HTR-United is a catalog and an ecosystem for sharing and finding ground truth for optical character or handwritten text recognition (OCR/HTR).

HTR-United
<i>Livre des simples médecines</i>, in French

Wellcome Collection

@simon The (oddly hard to find) Textractor python library does this nicely, with async interface too:

> pip[x] install amazon-textract-textractor
> textractor detect-document-text your_file.png output.json

https://aws-samples.github.io/amazon-textract-textractor/commandline.html

But maybe it's processing the output into something useful that you needed? Parsing their JSON can be tricky, but that library also has a Document class with handy `to_markdown` or `to_pandas` methods

CLI — amazon-textract-textractor 1.0.0 documentation

@symroe well that would have saved me a bit of time! Thanks for the link, I'll add that to the textract-cli README
@simon I also built about 70% of a DIY solution before finding it! 🤷‍♂️
@simon Tesseract (the non-JS version) was originally created by HP in the 1980s and open-sourced in 2005.
@simon nice! i was thinking of trying to do something similar to autogenerate alt text, which i currently tend to do by opening images in chrome and using google lens (far too many clicks)
@molly0xfff Yes! I first used something like this for the alt text in my annotated presentation tool here: https://til.simonwillison.net/tools/annotated-presentations
Annotated presentation creator

@molly0xfff @simon in case you didn't know: both Mastodon's web version and Ivory offer OCR for uploaded images
@simon Very cool. Though I get a Heroku error when I try to go to your site ("Application error: An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command heroku logs --tail")
@aaronjschaffer Huh... it looks like it's the Mastodon effect, where sending out a link causes thousands of Mastodon servers to all hit /.well-known/webfinger?resource=acct:[email protected] at the same time - but I've survived these storms just fine in the past, not sure why it's hurting the site today
@simon Ah gotcha! I love a little suspense, I'll check again later!
@aaronjschaffer Worked through it here, should be working OK again now https://github.com/simonw/simonwillisonblog/issues/415
Get Cloudflare to cache /.well-known/webfinger · Issue #415 · simonw/simonwillisonblog

I'm getting a huge flurry of hits to this URL right now because I tooted a link: https://simonwillison.net/.well-known/webfinger?resource=acct:[email protected] It seems to have made my site ...

GitHub
@simon under 8MB! 3MB each for Tesseract WASM and the training data.
@simon need more such browser-only, offline-first, privacy-first apps that don't require any installation or configuration!

@prem_k @simon If you didn't see it at the time, this was quite a cool offline browser-based transcription tool posted a few weeks back:

https://bne.social/@simon/112057608292224084

Like you, I love these kinds of tools but if I could *beg* the authors for one feature - please make it easy to download the needed files so I can run it all truly offline :)

Simon Elvery (@[email protected])

I wrote about scratching my own itch and building a transcription tool. It's completely private, neither the audio or the transcript ever leaves your browser. If this is the kind of tool you use, I'd love to hear your feedback (both on the write-up and the tool). https://elvery.net/drzax/cobbling-together-a-private-machine-transcription-and-editing-tool/

bne.social
@StuartGray @prem_k That really is a worthwhile feature for this one, I've opened an issue - no promises I'll solve it though, there are things in there relating to bundling that I don't know how to do yet https://github.com/simonw/tools/issues/2
Version of OCR that can run entirely offline · Issue #2 · simonw/tools

Currently https://tools.simonwillison.net/ocr loads assets from a CDN. A version that can run offline would be fantastic. It would be a tiny bit tricky to get versions of PDF.js and Tesseract.js (a...

GitHub
@simon Insanely cool! It works fine in Android Chrome (no luck with Firefox though).
Johannes Baiter (@[email protected])

Attached: 1 video 👀 #demotime Ever wish you could search through a #IIIF manifest, but the provider had neither #OCR, nor a #ContentSearch endpoint available? 🪄 You can soon help yourself: Fully client-side OCR and Content Search + Autocomplete for Mirador 3. And it survives page reloads! ✨

OpenBiblio.Social
@simon I love reading about your process. It's been so fun to create small applets using AI with a bit of human assist. Do you think the Tesseract OCR has improved over the past few years? I remember it being quite sloppy back in the day.
@Jage it definitely has - they moved to a fancy LSTM neural network based thing within the last 5 years I think
@simon Very cool. Thanks for sharing.
@simon This is very cool. I’ve been wondering for a while if there’s a similar tool for handwritten text. One of my family members left behind journals that are very difficult to read, though I can make out letters here and there. It seems like it should be possible to make some sort of trainable, handwriting reader, but I don’t know if anyone’s done that.
@greatblueheron AWS Textract works really well with handwriting in my experience
@simon I really like this! Really nice to read the JS, HTML, CSS all in one file too 👏 awesome! I find projects putting models in browsers super interesting right now!
@simon interested to hear more about your analysis on "very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently—I’ll write more about that soon. But those tools are still inconvenient for most people to use."
Multi-modal support for vision models such as GPT-4 vision · Issue #331 · simonw/llm

https://platform.openai.com/docs/guides/vision I think this is best handled by command line options --image and --image-urls to either encode and pass as base64, or to pass a URL.

GitHub
@simon Hi are you also looking at upgrading the OCR engines, probably using VikParuchuri/surya https://github.com/VikParuchuri/surya
GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, table recognition in 90+ languages

OCR, layout analysis, reading order, table recognition in 90+ languages - VikParuchuri/surya

GitHub
Vik Paruchuri (@VikParuchuri) on X

Announcing surya layout! It detects tables, images, figures, section headers, and more. It works with any language, and a variety of document types. Find it here - https://t.co/DD2HfwI8jK . Thanks @LambdaAPI for sponsoring compute.

X (formerly Twitter)
@kaveinthran for this particular project I'm only looking at libraries I can run in a web browser, but thanks for the tip - I hadn't seen that one before
@simon how does it do with data in boxes? Most of what I have to do OCR on are forms, and most of the data is in boxes. OCR that reads whole lines, or even columns, has problems with this.
@mcrocker pretty badly - Tesseract isn't the best tool for that. I would expect AWS Textract or maybe Claude 3 Opus or Gemini Pro 1.5 to handle those better, though they still aren't completely error free in my experience

@simon Simon (and/or anyone in this thread) :

Is there a good tool/library for extracting text from handwritten note (converted to image via photo) ?
Tesseract doesn't work well.
I tried Google lens with better result, but that mean I need to upload the image to their server.

#ocr for #handwritten text

@simon FWIW, I ran OCR on image from https://hamel.dev/blog/posts/evals/ - the post you shared earlier.
I'm pretty sure this is created via Excalidraw (or similar tool) and text is a font.
But OCR was 50% correct at best.
Your AI Product Needs Evals –

How to construct domain-specific LLM evaluation systems.

@mandarvaze I'm not particularly surprised - I don't think Tesseract is very good at illustrations, or indeed anything that's not regular "typewritten" text

But for the boring stuff it works fantastically well

I came across https://notes.joeldare.com/handwritten-text-recognition from Simon's post being discussed on HN
Handwritten Text Recognition

@simon good to know about this tool. I’ve been playing lately with tabula to extract data from long PDFs, and it works very nice with structured data, but for the plain text it doesn’t work smooth, so I’ll take a look at this