Character bigrams and naive bayes can get you pretty darned far.
Oh, and a couple of agents and a boatload of data.
And, I guess all of the researchers whose shoulders I'm standing on...
Files and search. Founder Rhapsode Consulting LLC. Chair/VP Apache Tika, committer Apache PDFBox, Apache POI, Apache Lucene/Solr, Apache Nutch, Apache OpenNLP. Philologist emeritus.
#ApacheTika #ApachePDFBox #ApachePOI #FileFormats #FileForensics #ApacheSolr #OpenSearch #ApacheNutch #ApacheStormCrawler #JavaSecurity
#foss #OpenSource #bassist #fedi22 #🏳️🌈🏳️⚧️Ally
| github | https://github.com/tballison |
| https://www.linkedin.com/in/tim-allison-5a6722/ |
Character bigrams and naive bayes can get you pretty darned far.
Oh, and a couple of agents and a boatload of data.
And, I guess all of the researchers whose shoulders I'm standing on...
Voting is underway for #ApacheTika 4.0.0-alpha-1! 🎉
Started work on the 4.x branch in October 2024. Lots has changed, core principles remain.
Many, many thanks to the community of fellow devs and users!
Onwards towards 4.0.0!
https://lists.apache.org/thread/bjowzh4ssgtrghqjk7g2dtn9hs3qmyrv
Preview revamp of our website for #ApacheTika 4.x is live: https://tika.apache.org/docs/4.0.0-SNAPSHOT/
Let us know what you think and/or open PRs! Please!
Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!
https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3
Anthropomorphizing the technology is just one more way humans try to escape accountability. “The AI contributed a patch”, “the AI wrote the blog post”, “the car hit the pedestrian” and “the knife killed the victim”, those are all the same framing.
Only 7 million, 999 thousand more PDFs to go...🤣🤣🤣
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/