Mastodawn

Simon Willison Mar 30, 2024

I built a new tool: https://tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

OCR PDFs and images directly in your browser

Show thread

Mandar Vaze (desipenguin)Mar 31, 2024

@simon Simon (and/or anyone in this thread) :

Is there a good tool/library for extracting text from handwritten note (converted to image via photo) ?
Tesseract doesn't work well.
I tried Google lens with better result, but that mean I need to upload the image to their server.

#ocr for #handwritten text

Show thread

Mandar Vaze (desipenguin)

@simon FWIW, I ran OCR on image from https://hamel.dev/blog/posts/evals/ - the post you shared earlier.
I'm pretty sure this is created via Excalidraw (or similar tool) and text is a font.
But OCR was 50% correct at best.

Your AI Product Needs Evals –

How to construct domain-specific LLM evaluation systems.

Show thread

Simon Willison Apr 1, 2024

@mandarvaze I'm not particularly surprised - I don't think Tesseract is very good at illustrations, or indeed anything that's not regular "typewritten" text

But for the boring stuff it works fantastically well