Mastodawn

Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.

It came back with this on one file.

Show thread

Tim Allison Feb 12

Which is frankly, jaw dropping. The solution, though, was then to hack out some bigram character frequency/language-dependent heuristics based on the evidence of ONE file.

Show thread

Tim Allison Feb 12

So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.

Then I asked gemini for help and gave that to claude...

Show thread

Tim Allison Feb 12

And that didn't work directly, but claude then found an answer...maybe?

Show thread

Tim Allison

Only 7 million, 999 thousand more PDFs to go...🤣🤣🤣

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

Stay tuned! 🍿🍿🍿

And even on these 5 files, that attempt was only right on 3...lol. @wtfpdf