Tim Allison

@tallison
545 Followers
318 Following
2.4K Posts

Files and search. Founder Rhapsode Consulting LLC. Chair/VP Apache Tika, committer Apache PDFBox, Apache POI, Apache Lucene/Solr, Apache Nutch, Apache OpenNLP. Philologist emeritus.

#ApacheTika #ApachePDFBox #ApachePOI #FileFormats #FileForensics #ApacheSolr #OpenSearch #ApacheNutch #ApacheStormCrawler #JavaSecurity
#foss #OpenSource #bassist #fedi22 #🏳️‍🌈🏳️‍⚧️Ally

githubhttps://github.com/tballison
linkedinhttps://www.linkedin.com/in/tim-allison-5a6722/
Living the dream... 🤖
And even on these 5 files, that attempt was only right on 3...lol. @wtfpdf
And that didn't work directly, but claude then found an answer...maybe?

So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.

Then I asked gemini for help and gave that to claude...

Which is frankly, jaw dropping. The solution, though, was then to hack out some bigram character frequency/language-dependent heuristics based on the evidence of ONE file.

Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.

It came back with this on one file.

OMG. Claude is on its game this morning!
#Claude why the cliffhangers?

Serialization is hard.

Claude is there to support me...

Y, I'll say that was a simplification.

*edited after I forgot git add 🤣 *