Mastodawn

This task gave me my first real experience with OCR technology
and I came out of it with findings I did not see coming at all!

I used Docling to process scanned documents in French, Italian
and Telugu using three different OCR engines.

Here is what happened 👇

Why does this matter?

Ramalama uses Docling to convert documents into text before
feeding them into AI models. If OCR fails, the AI gets garbage.

So getting multilingual OCR right is really important for
building good RAG pipelines in a global community like Fedora.

What I tested:

→ Tesseract — classical OCR, 100+ languages
→ EasyOCR — AI-based, modern approach
→ ocrmac — Apple's built-in Vision framework

Documents: French+English textbook, Italian reader,
old Telugu manuscript and a modern Telugu novel.

What I learned:

Font style matters MORE than language support!

Old Telugu manuscript → all engines struggled
Modern Telugu novel → all engines worked well

And surprisingly EasyOCR beat Tesseract on modern Telugu!

Exploring multilingual OCR with Docling, Tesseract EasyOCR and ocrmac - ChinniSree/Docling-processing-multilingual-documents

GitHub