Mastodawn

Tim Allison Oct 15, 2025

I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

Detectalix Aug 25, 2025

NEW YOUTUBE VIDEO on the forensic analysis of executables, image and document files, using different open source tools:
https://youtu.be/_ttnwLSt2P8

#DigitalForensics #fileforensics #fileanalysis #filemetadata

File analysis tools

YouTube

Show thread

Tim Allison Nov 5, 2024

@Ange "filtering files quickly and possibly reliably" 🤣🤣🤣

Thank you for sharing. This is a fantastic talk!

#fileformatology #fileFormatGeekery #fileforensics

Tim Allison Jan 17, 2024

Anyone in #fileforensics #forensics #digitalforensics willing to offer an informational interview?

I'm trying to figure out if that would be a good fit.

I have a substantial track record in open source communities and decent knowledge of file formats and some of the mayhem available. 😄

#fedihire

Tim Allison Nov 9, 2023

@decalage recently asked me if we had any files with Adobe LiveCycle's Usage Rights. I hadn't come across these before, but I think they'd have important implications for #fileforensics and #digipres

Adobe's link: https://help.adobe.com/en_US/livecycle/11.0/Services/WS92d06802c76abadb-6ec569c512dbeb3d9d6-7ffd.2.html

I opened https://issues.apache.org/jira/browse/TIKA-4168 to track discussion of this.

If you care about this topic or can offer technical advice on how to extract this info, please help!

cc @PDFassociation

Adobe LiveCycle ES4 * Applying usage rights to PDF documents

Tim Allison Jul 20, 2023

I've gotten a bunch of #infosec followers over the last coupla days.

For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

Show thread

Tim Allison Jul 18, 2023

The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.

These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.

Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.

Show thread

Tim Allison Jul 18, 2023

We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.

There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.

#digipres #fileforensics

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

Digital Corpora: corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

Dan Conn Feb 18, 2023

Really looking forward to speaking at Brighton JUG, this Thursday 23rd February at 6pm GMT!

I'll be talking about Apache Tika and showing how some spicy information can be uncovered with it!

https://www.meetup.com/brighton-jug/events/290961686/

Or watch live here from 6:30pm GMT (you'll need to sort your own pizza and beer for this one though 😂)
https://www.youtube.com/watch?v=O8sjtnXgu98

#metadata #fileforensics #apachetika #mimetypes

February 2023 Brighton JUG Meetup, Thu, Feb 23, 2023, 6:00 PM | Meetup

We are very excited to welcome **Dan Conn**. He will be giving a talk on **Today's Special: Apache Tika Masala!** ———————— **Event Format: Hybrid** * Join us in-person: 6:

Meetup