Porting SafeText and analyzing digital content with Apache Tika

by @beet_keeper

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

Continue reading “Porting SafeText and analyzing digital content with Apache Tika”


#ApacheTika #authenticity #Code #Coding #ContentAnalysis #Data #DigitalHumanities #digitalLiteracy #DigitalPreservation #Golang #integrity #Journalism #Metadata #Paradata #SafeText #steganography #Whistleblow #Whistleblower

Voting is underway for #ApacheTika 4.0.0-alpha-1! 🎉

Started work on the 4.x branch in October 2024. Lots has changed, core principles remain.

Many, many thanks to the community of fellow devs and users!

Onwards towards 4.0.0!

https://lists.apache.org/thread/bjowzh4ssgtrghqjk7g2dtn9hs3qmyrv

Preview revamp of our website for #ApacheTika 4.x is live: https://tika.apache.org/docs/4.0.0-SNAPSHOT/

Let us know what you think and/or open PRs! Please!

Apache Tika Documentation :: Apache Tika Documentation

Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!

https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3

On #ApacheTika we're moving entirely to json for configuration in 4.x.

If you use tika-server and are interested in runtime configuration, please take a look and offer feedback:

https://lists.apache.org/thread/jlt8jv47t8tm58dlrnxsrfodxm2d6o0z

Please repost for reach.

⚠️ CRITICAL XXE bug (CVE-2025-66516, CVSS 10.0) in Apache Tika (tika-core, tika-pdf-module, tika-parsers). Exploitation via crafted PDFs can lead to file disclosure & RCE. Upgrade to 3.2.2+ ASAP! https://radar.offseq.com/threat/critical-xxe-bug-cve-2025-66516-cvss-100-hits-apac-d08561e7 #OffSeq #ApacheTika #XXE #Security
🚨 CVE-2025-66516 CRITICAL: XXE in Apache Tika core (v1.13–3.2.1), tika-pdf-module, tika-parsers. Exploitable via crafted PDF XFA files — risks data exfil & DoS. Patch to 3.2.2+ now! https://radar.offseq.com/threat/cve-2025-66516-cwe-611-improper-restriction-of-xml-fa601313 #OffSeq #ApacheTika #XXE #Vuln

RE: https://mastodon.social/@tallison/115452030199746498

Please join me tomorrow, November 13 at noon EST to chat #ApacheTika.

Please dm me for the connection info.

LOL.. given that I'm going to be a remote presenter, I taped my Digital Preservation Bake-off talk last night in case I have wifi-problems during the session.

I really wish conferences would require 3 or 4 videos of the talk before I'm allowed to speak.

#ipres2025 #digipresBakeoff #ApacheTika

In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup at noon on November 13 EST.

This is intended for anyone interested in files from search to digital preservation to file forensics/reverse engineering folks.

https://www.meetup.com/apache-tika-community/events/311746184

#wdpd2025 #ApacheTika

Apache Tika -- What's New/Office Hours, Thu, Nov 13, 2025, 12:00 PM | Meetup

This will be an expansion of my presentation at the Digital Preservation Bake Off (Tools Demonstration) #iPres2025 and a late entry to celebrate World Digital Preservation

Meetup