PDFs: still the final boss. The Verge says Hugging Face found ~1.3B PDFs in Common Crawl, and the Allen Institute for AI thinks they could yield trillions of training tokens—if models can stop confusing footnotes with the main text (or hallucinating). Time to de-hairball the format? 😼

https://it.slashdot.org/story/26/02/23/1833239/how-many-ais-does-it-take-to-read-a-pdf

#Reducto #PDF #AI

'How Many AIs Does It Take To Read a PDF?' - Slashdot

Despite AI's progress in building complex software, the ubiquitous PDF remains something of a grand challenge -- a format Adobe developed in the early 1990s to preserve the precise visual appearance of documents. PDFs consist of character codes, coordinates, and rendering instructions rather than lo...