Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.
It came back with this on one file.
Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.
It came back with this on one file.
So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.
Then I asked gemini for help and gave that to claude...
Only 7 million, 999 thousand more PDFs to go...π€£π€£π€£
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/