I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

@Ange "filtering files quickly and possibly reliably" 🤣🤣🤣

Thank you for sharing. This is a fantastic talk!

#fileformatology #fileFormatGeekery #fileforensics

3) I reran Tika, 'file' and #siegfried on all the files.

You can explore the mimes via datasette: https://corpora.tika.apache.org/datasette

Or, download the whole sqlite db: https://corpora.tika.apache.org/base/share/tika-mimes-20230714.db.gz

I mean, who wouldn't want to spend the weekend looking for differences btwn #siegfried and #file and #ApacheTika?!

#filefun #digipres #fileformat #fileformatology

[TIKA-4059] Consider parsing common gzipped formats like we do with package files - ASF JIRA