Porting SafeText and analyzing digital content with Apache Tika

by @beet_keeper

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

Continue reading “Porting SafeText and analyzing digital content with Apache Tika”


#ApacheTika #authenticity #Code #Coding #ContentAnalysis #Data #DigitalHumanities #digitalLiteracy #DigitalPreservation #Golang #integrity #Journalism #Metadata #Paradata #SafeText #steganography #Whistleblow #Whistleblower
New article w Olle Sköld, Dydimus Zengenene, Lisa Andersson ”What a standard makes out of a process? Data-documentation standards and their consequences to process documentation” out in JDOC https://doi.org/10.1108/JD-10-2025-0324 #OpenAccess #paradata #CAPTURE_ERC
I am really excited to be a part of conf Disentangling the Intertwinement of Digitalisation and Decolonisation conference at the Royal Danish Academy of Sciences and Letters org’d by Eleanor Q. Neil and Rubina Raja
(Aarhus University) with a talk on ”Digital Dataset as an Archive” #archaeology #data #archives #paradata https://urbnet.au.dk/news/events/2025/disentangling
⏰ Reminder: Tomorrow!
ENLIGHT & Arqus Alliance OS Webinar
Topic: Data makers’ and users’ views on useful paradata
🗓️ Mon, Sept 29 | 10:00–11:00 CET
💻 Online | 🎙️ Prof. dr. Isto Huvila (Uppsala University)
Don’t miss insights on what info about data creation, curation & use (paradata) makes data reusable!
🔗 Register: https://us05web.zoom.us/meeting/register/jKSxX6mJRvGpEaYmTqqKKg#/registration
ℹ️ More info: https://enlight-eu.org/landing-research-and-innovation/open-science/1040-enlight-rise-and-arqus-alliance-ambassador-webinar-series-on-open-science
#OpenScience #Paradata #ResearchData
📢 Upcoming ENLIGHT & Arqus Alliance OS Webinar!
Topic: Data makers’ and users’ views on useful paradata
🗓️ Monday, Sept 29, 10:00–11:00 CET
💻 Online
🎙️ Prof. dr. Isto Huvila (Uppsala University)
Paradata = the metadata about how data is created, curated, manipulated, and used — crucial for reusability.
🔗 Register: https://us05web.zoom.us/meeting/register/jKSxX6mJRvGpEaYmTqqKKg#/registration
ℹ️ More info: https://enlight-eu.org/landing-research-and-innovation/open-science/1040-enlight-rise-and-arqus-alliance-ambassador-webinar-series-on-open-science
📺 Previous webinars: https://www.youtube.com/playlist?list=PLnfetl7rb1WIhBuY-OuOU6B_G8yquro55
#OpenScience #Paradata #ResearchData
This week in Leiden at Lorentz Center working on #3D #paradata with @cpapadopoulos and a fantastic group of colleagues https://www.lorentzcenter.nl/paradata-in-3d-scholarship.html
Slides for today’s talk on Documenting the unruly AI: Capturing sociotechnical practices with paradata at #WORK2025 conf available at https://istohuvila.se/content/documenting-unruly-ai-capturing-sociotechnical-practices-paradata #CAPTURE_ERC #paradata #AI #AIResearch
Documenting the unruly AI: Capturing sociotechnical practices with paradata | Isto Huvila

Presentation at the WORK 2025 conference in Turku, Finland.

Out now! A book length introduction to and comprehensive exploration of #paradata ”Paradata: Documenting Data Creation, Curation and Use” from #CAPTURE_ERC available #openaccess from Cambridge University Press https://www.cambridge.org/fi/universitypress/subjects/computer-science/computing-and-society/paradata-documenting-data-creation-curation-and-use with Zanna Friberg, Olle Sköld, Lisa Andersson & Ying-Hsang Liu
Paradata | Cambridge University Press & Assessment

Cambridge University Press & Assessment
New article from #CAPTURE_ERC with Lisa Andersson and Olle Sköld : Researchers engage in #paradata generation both as integrated in their research work and as a discrete standalone activity, both with implications to generated paradata https://doi.org/10.1002/asi.70003 #openaccess #ERC_research