Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.

It came back with this on one file.

Which is frankly, jaw dropping. The solution, though, was then to hack out some bigram character frequency/language-dependent heuristics based on the evidence of ONE file.

So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.

Then I asked gemini for help and gave that to claude...

And that didn't work directly, but claude then found an answer...maybe?

Only 7 million, 999 thousand more PDFs to go...🀣🀣🀣

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

Stay tuned! 🍿🍿🍿
And even on these 5 files, that attempt was only right on 3...lol. @wtfpdf