I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

NEW YOUTUBE VIDEO on the forensic analysis of executables, image and document files, using different open source tools:
https://youtu.be/_ttnwLSt2P8

#DigitalForensics #fileforensics #fileanalysis #filemetadata

File analysis tools

YouTube

@Ange "filtering files quickly and possibly reliably" ๐Ÿคฃ๐Ÿคฃ๐Ÿคฃ

Thank you for sharing. This is a fantastic talk!

#fileformatology #fileFormatGeekery #fileforensics

Anyone in #fileforensics #forensics #digitalforensics willing to offer an informational interview?

I'm trying to figure out if that would be a good fit.

I have a substantial track record in open source communities and decent knowledge of file formats and some of the mayhem available. ๐Ÿ˜„

#fedihire

@decalage recently asked me if we had any files with Adobe LiveCycle's Usage Rights. I hadn't come across these before, but I think they'd have important implications for #fileforensics and #digipres

Adobe's link: https://help.adobe.com/en_US/livecycle/11.0/Services/WS92d06802c76abadb-6ec569c512dbeb3d9d6-7ffd.2.html

I opened https://issues.apache.org/jira/browse/TIKA-4168 to track discussion of this.

If you care about this topic or can offer technical advice on how to extract this info, please help!

cc @PDFassociation

Adobe LiveCycle ES4 * Applying usage rights to PDF documents

I've gotten a bunch of #infosec followers over the last coupla days.

For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) โ€“ Digital Corpora

The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.

These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.

Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.

We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.

There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.

#digipres #fileforensics

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

Digital Corpora: corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

Really looking forward to speaking at Brighton JUG, this Thursday 23rd February at 6pm GMT!

I'll be talking about Apache Tika and showing how some spicy information can be uncovered with it!

Sign up here to join in person with beer and pizza

https://www.meetup.com/brighton-jug/events/290961686/

Or watch live here from 6:30pm GMT (you'll need to sort your own pizza and beer for this one though ๐Ÿ˜‚)
https://www.youtube.com/watch?v=O8sjtnXgu98

#metadata #fileforensics #apachetika #mimetypes

February 2023 Brighton JUG Meetup, Thu, Feb 23, 2023, 6:00 PM | Meetup

We are very excited to welcome **Dan Conn**. He will be giving a talk on **Today's Special: Apache Tika Masala!** โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€” **Event Format: Hybrid** * Join us in-person: 6:

Meetup