Porting SafeText and analyzing digital content with Apache Tika
by @beet_keeperLast year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.
The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.
Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.
I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.
A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.
I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.
By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.
Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…
Lets have a look at putting Tika and SafeText together and see where it goes.
Continue reading “Porting SafeText and analyzing digital content with Apache Tika”…
#ApacheTika #authenticity #Code #Coding #ContentAnalysis #Data #DigitalHumanities #digitalLiteracy #DigitalPreservation #Golang #integrity #Journalism #Metadata #Paradata #SafeText #steganography #Whistleblow #Whistleblower






