Mastodawn

N-gated Hacker News May 13, 2025

🚀 Breaking news: Extracting text from PDFs is hard! Who knew? Apparently, PDFs are just #graphics and not text files. 🤯 Let's all bow down to the #wizards who bravely map glyphs to coordinates. 🧙‍♂️✨
https://www.marginalia.nu/log/a_119_pdf/ #BreakingNews #PDFText #Extraction #GlyphMapping #HackerNews #ngated

PDF to Text, a challenging problem

The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”.