PDF -> Markdown 100 pages per second.

I hate that it's for #AI but I also think it will have connotations for #digipres one way or another.

#wtfpdf

https://social.lansky.name/@hn50/115255089991079589

Hacker News 50 (@[email protected])

OpenDataLoader-PDF: An open source tool for structured PDF parsing Link: https://github.com/opendataloader-project/opendataloader-pdf Discussion: https://news.ycombinator.com/item?id=45347147

Mastodon
I’m adding a couple photos on file structure. #icabarcelona2025 #wtfpdf #digipres #odfa
Has anyone worked with ODF/A? Have it in your corpus? I found it interesting that one of the reasons for developing ODF/A was issues around Chinese characters in PDF (example: difficulties in copying from PDF). #digipres @wtfpdf #icabarcelona2025 #wtfpdf
#archivtagAT #archivtag2025 Andreas Rauber zeigt ein Beispiel von einem PDF, auch HTML hat oder als Virtual Machine gespeichert werden kann, die dann erweiterte Funktionen haben. Was passiert, wenn man so ein PDF normalisiert oder migriert wird? #wtfPDF

pdfalyze --help
(https://github.com/michelcrypt4d4mus/pdfalyzer)

outputs:

"Explore PDF's inner data structure with absurdly large and in depth visualizations. Track the control flow of her darker impulses, scan rivers of her binary data for signs of evil sorcery, and generally peer deep into the dark heart of the Portable Document Format. Just make sure you also forgive her - she knows not what she does."

#wtfpdf

GitHub - michelcrypt4d4mus/pdfalyzer: Analyze PDFs. With colors. And Yara.

Analyze PDFs. With colors. And Yara. Contribute to michelcrypt4d4mus/pdfalyzer development by creating an account on GitHub.

GitHub

On the effects of the useful

`mutool clean`

command to "repair" PDFs.

If you take this PDF https://openreview.net/pdf?id=CSJYz1Zovj and apply the command to it... the 27-page annex is cut off from the "cleaned" output (still in the PDF, but unreferenced, so not displayed).

So use it with care!

#wtfpdf

Well, I've finally made a blog post, but on the #OPF website!

https://openpreservation.org/blogs/validation-ok-they-said-fixing-the-rendering-of-a-so-called-valid-pdf

I'm walking you through the most complex (out of 2) PDF repair processes I've made. Any input is welcome!

#digipres #wtfpdf

Today's #wtfPDF moment: #PDFs with images that are encoded as #JPEG, where the JPEG data stream in turn is ascii85 encoded.

WHY?!? (The ascii85 encoding only inflates the JPEG data streams by 25% and doesn't offer any benefits).

https://github.com/KBNLresearch/pdfquad/issues/39

Image XObject with /ASCII85Decode /DCTDecode filter · Issue #39 · KBNLresearch/pdfquad

I have some PDFs with images encoded like this: 5 0 obj << /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /DCTDecode ] /Height 3676 /Length 2303713 /Subtype /Image /Type /XObje...

GitHub
This is in French, but the link is to a 404 media story. Every now and then it’s #wtfpdf FTW. #digitalforensics #digitaldiplomatics
From: @BertrandCaron
https://digipres.club/@BertrandCaron/113923145282409719
Bertrand Caron (@[email protected])

Content warning: Politique US - Métadonnées

digipres.club