Mastodawn

Simon Willison Mar 30, 2024

I built a new tool: https://tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

OCR PDFs and images directly in your browser

Show thread

Simon Willison Mar 30, 2024

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter https://github.com/simonw/tools/blob/9fb049424f4ec8f8ffb91a59ab7111cad56088fc/ocr.html

tools/ocr.html at 9fb049424f4ec8f8ffb91a59ab7111cad56088fc · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub

Show thread

Simon Willison Mar 30, 2024

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 https://github.com/naptha/tesseract.js/commit/906ce3cadbffaf5f7317a4418f282c4b78bf8385

First to PDF.js was Apr 25, 2011 https://github.com/mozilla/pdf.js/commit/6dc1770bba7a417ce5664c0305469e5bb7ea76bd

init · naptha/tesseract.js@906ce3c

Pure Javascript OCR for more than 100 Languages 📖🎉🖥 - init · naptha/tesseract.js@906ce3c

GitHub

Show thread

Simon Willison Mar 30, 2024

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

https://github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

GitHub - simonw/textract-cli: CLI for running files through AWS Textract

CLI for running files through AWS Textract. Contribute to simonw/textract-cli development by creating an account on GitHub.

GitHub

Show thread

Simon Willison Mar 31, 2024

New feature for my browser-based OCR tool: you can now select the Tesseract.js language to use, from a list of 102 options

https://tools.simonwillison.net/ocr

OCR PDFs and images directly in your browser

Show thread

Simon Willison

One tiny extra detail which possibly only I care about: changing the selection in the language select now updates a ?language=x query string, so you can bookmark a language and the back/forward buttons navigate through that selected state

Here's OCR for Welsh, bookmarked: https://tools.simonwillison.net/ocr?language=cym

OCR PDFs and images directly in your browser

Show thread

Simon Willison Apr 1, 2024

Anyone get any documents lying around in Middle Ages French, circa 1400-1600?

Apparently Tesseract / Tesseract.js can handle them, so I'd love to see my tool try!

https://tools.simonwillison.net/ocr?language=frm

OCR PDFs and images directly in your browser

Show thread

Simon Willison Apr 1, 2024

... definitely going to stop tinkering with this thing now, but I did add a few basic automated tests just now using Playwright Python https://github.com/simonw/tools/blob/main/tests/test_ocr.py - and a tiny bit of assistance from Claude 3 Opus https://github.com/simonw/tools/issues/8#issuecomment-2029152772

tools/tests/test_ocr.py at main · simonw/tools

Assorted tools. Contribute to simonw/tools development by creating an account on GitHub.

GitHub

Show thread

Joel Kin Apr 1, 2024

@simon that’s the claim this band makes, though I can’t say I have the wherewithal to verify it! https://genius.com/albums/Vehemence-fr/Ordalies

Show thread

Joel Kin Apr 1, 2024

@simon I do have a pdf of a French play from around 1630 if you’d like

Show thread

@frueheneuzeit Apr 1, 2024

@simon you should get in touch with this guy: https://huggingface.co/Pclanglais Alexander Doria on Twitter and bluesky.

Pclanglais (Pierre-Carl Langlais)

Tintin & Mickey!

Show thread

Johan Richer Apr 1, 2024

@simon https://gallica.bnf.fr/ark:/12148/btv1b86000209/f14.item

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme

Gargantua. La Vie inestimable du grand Gargantua, père de Pantagruel , jadis composée par l'abstracteur de quinte essence. Livre plein de pantagruélisme -- 1535 -- livre

Gallica

Show thread

: j@fabrica:~/src;

Apr 1, 2024

@simon Now I’m wondering about Linear A

Show thread

Mia Apr 1, 2024

@simon https://manuscrits-france-angleterre.org/polonsky/en/content/accueil-en (but only the BnF ones display at the moment because of the BL situation)

France-England: medieval manuscripts between 700 and 1200

Show thread

rd_palmer Apr 1, 2024

@simon perhaps some of use in Alix Chagué and Thibault Clérice's project to provide a catalogue of training datasets: https://htr-united.github.io/catalog.html

HTR-United

HTR-United is a catalog and an ecosystem for sharing and finding ground truth for optical character or handwritten text recognition (OCR/HTR).

HTR-United

Show thread

alexwlchan Apr 1, 2024

@simon 1470, French, seems like a good bet: https://wellcomecollection.org/works/eu3ym7su/items?canvas=8

<i>Livre des simples médecines</i>, in French

Wellcome Collection