Tim Allison

@tallison
544 Followers
318 Following
2.4K Posts

Files and search. Founder Rhapsode Consulting LLC. Chair/VP Apache Tika, committer Apache PDFBox, Apache POI, Apache Lucene/Solr, Apache Nutch, Apache OpenNLP. Philologist emeritus.

#ApacheTika #ApachePDFBox #ApachePOI #FileFormats #FileForensics #ApacheSolr #OpenSearch #ApacheNutch #ApacheStormCrawler #JavaSecurity
#foss #OpenSource #bassist #fedi22 #🏳️‍🌈🏳️‍⚧️Ally

githubhttps://github.com/tballison
linkedinhttps://www.linkedin.com/in/tim-allison-5a6722/

Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!

https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3

Living the dream... 🤖

Anthropomorphizing the technology is just one more way humans try to escape accountability. “The AI contributed a patch”, “the AI wrote the blog post”, “the car hit the pedestrian” and “the knife killed the victim”, those are all the same framing.

https://swecyb.com/@anderseknert/116056950299738296

And even on these 5 files, that attempt was only right on 3...lol. @wtfpdf
Stay tuned! 🍿🍿🍿

Only 7 million, 999 thousand more PDFs to go...🤣🤣🤣

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

And that didn't work directly, but claude then found an answer...maybe?

So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.

Then I asked gemini for help and gave that to claude...

Which is frankly, jaw dropping. The solution, though, was then to hack out some bigram character frequency/language-dependent heuristics based on the evidence of ONE file.

Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.

It came back with this on one file.