๐Ÿš€ Running #OCR at scale with a #Vision #LLM for $0.49/hour

Just deployed dots.ocr (3B parameter Vision LLM by RedNote) on a single #RTX A6000 (48GB VRAM) via #RunPod. The results are great:

https://github.com/rednote-hilab/dots.ocr

#ai #opensource

๐Ÿ“„ The Setup
- Upload any #PDF โ†’ server converts each page to an image (PyMuPDF)
- Images are sent in parallel to #vLLM (continuous batching)
- The Vision LLM reads each page and returns clean Markdown

๐Ÿงต ๐Ÿ‘‡

- Results stream back as NDJSON โ€” no timeouts, even on 100+ page docs

โšก Performance (32-page PDF, 2.9 MB)
- Single PDF: 22 seconds (all 32 pages OCR'd)
- 6-8 PDFs in parallel: GPU fully saturated
- ~7,500 pages/hour peak throughput
- ~230 PDFs/hour (32 pages each)
- Zero errors under full load

๐Ÿ’ฐ Cost Comparison (per page)
- dots.ocr on RunPod: $0.00007
- Google Document AI: $0.0015 (OCR) / $0.03 (Form Parser)

- AWS Textract: $0.0015 (Detect Text) / $0.015 (Tables/Forms)
- Azure Doc Intelligence: $0.00125 (Read) / $0.01 (structured)

That's ~23x cheaper than cloud OCR for basic text extraction โ€” and up to 140x cheaper compared to structured extraction tiers. ๐Ÿ“Š

Processing 1,000 PDFs (32,000 pages): $2.32 vs ~$48 (cloud basic OCR) vs $320+ (cloud structured).

The entire stack is #opensource: dots.ocr model from #HuggingFace, vLLM for inference, #FastAPI proxy with parallel rendering + streaming. Total model size ~12GB, runs comfortably on any 24GB+ GPU.

Vision LLMs are making traditional OCR engines obsolete. No templates, no preprocessing rules, no layout config โ€” just send an image, get structured text back. ๐ŸŽฏ