Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.
It came back with this on one file.
Celebrating the majesty, the mystery, the comedy and the catastrophe of PDFs....mostly the latter two. Opinions not even mine.
#WtfPdf #pdf #PortableDocumentFormat #FileFormats #FileForensics #DigiPres #fedi22
| Location | On a digital device near you and page 254 |
Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.
It came back with this on one file.
This just made me laugh.
Opened up this photo in quick look on my Mac, where it showed me the 'select text' icon. Weird, there's no text in it.
So I clicked it and searched, eventually noticing that 2 windows of a house in the distance have been gently highlighted. Weird. Copy the 'text' and paste it into my text editor, and...
Apparently that house is Simplified Chinese for 'Gold'
金金
"Some artists and albums will benchmark your utf8 support and annoy your operating system."
https://dustri.org/b/horrible-edge-cases-to-consider-when-dealing-with-music.html
at #ipres2025 @pwyatt explains that there is no one true PDF/A for a given content.
- there are infinite number of PDF/A possiblilites for any content
- anything can be made PDF/A - does not ensure it is a faithful representation
Vendors implement PDF/A conversion differently, so 5 different PDF/A conversions from diff software will likely give you 5 different files - all of them valid.