Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!
https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3
Files and search. Founder Rhapsode Consulting LLC. Chair/VP Apache Tika, committer Apache PDFBox, Apache POI, Apache Lucene/Solr, Apache Nutch, Apache OpenNLP. Philologist emeritus.
#ApacheTika #ApachePDFBox #ApachePOI #FileFormats #FileForensics #ApacheSolr #OpenSearch #ApacheNutch #ApacheStormCrawler #JavaSecurity
#foss #OpenSource #bassist #fedi22 #🏳️🌈🏳️⚧️Ally
| github | https://github.com/tballison |
| https://www.linkedin.com/in/tim-allison-5a6722/ |
Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!
https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3
Anthropomorphizing the technology is just one more way humans try to escape accountability. “The AI contributed a patch”, “the AI wrote the blog post”, “the car hit the pedestrian” and “the knife killed the victim”, those are all the same framing.
Only 7 million, 999 thousand more PDFs to go...🤣🤣🤣
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
So, I started my career in nlp in Perl, I get these hacky heuristics, and having spent 4 years on a deep dive into PDFs with DARPA's SafeDocs program, I totally understand how, ahem, special PDFs can be.
Then I asked gemini for help and gave that to claude...
Working on improving RTL text extraction from PDFs with claude. I gave it 1k pdfs, a few text extraction tools and a heuristic statistic to measure junk.
It came back with this on one file.