Mastodawn

What's your opinion on using the default unix "file" tool vs DROID to roughly-and-quicky-identify mixed data sets?

🙋‍♀️ ❓

#Digital ⚓️ #Vagabond 🦈May 20

@p3ter I like it because it's a bit easier to use, or I compromise and use Siegfried because of Siegfried's structured output.

If you're not going to persist the identifiers, or need to conform to a digital preservation standard and use the PUID then file good, quick and easy.

RE: processing you can maybe check out https://kellyjonbrazil.github.io/jc/ which allows you to pip file output to json, e.g. file * | jc --file

Which helps make it a bit easier to process by machine.

jc

CLI tool and python library that converts the output of popular command-line tools, file-types, and common strings to JSON, YAML, or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

jc

@beet_keeper to clarify what you mean:

"you like *it*" means droid or file?

(Haven't used Siegfried yet. Will try. This one, right? https://github.com/richardlehane/siegfried)

I've had quite some cases with AV files where file gets it, but DROID doesn't...

GitHub - richardlehane/siegfried: signature-based file format identification

signature-based file format identification. Contribute to richardlehane/siegfried development by creating an account on GitHub.

GitHub

#Digital ⚓️ #Vagabond 🦈May 20

@p3ter yeah, file is good. You have to use all the tools. If DROID doesn’t get it Siegfried probably won’t either although it has a few more methods.

#Digital ⚓️ #Vagabond 🦈May 20

@p3ter do you use MediaInfo for AV identification too? (I've always thought was was more likely to be the most reliable but I guess doesn't really give much on non-AV/heterogeneous collections)

@beet_keeper I use MediaInfo and `file` and exiftool all the time.

Therefore I've never had any reason to use DROID or Siegfried yet: That's why I was asking 😉

Thorsted May 20

@p3ter @beet_keeper They are not mutually exclusive. The PRONOM registry will have signatures that file and other tools are not aware of yet.

Bertrand Caron May 20

@p3ter as an exemple, the QuarkXPress 3.0 files are only identified by PRONOM (thanks @Thorsted!) and TrID, but neither Tika nor File know about it!

Andy Jackson May 20

@BertrandCaron @p3ter @Thorsted every format information source I've looked at has a significant fraction of unique entries https://www.digipres.org/workbench/registries/compare

Comparing Format Registries | DigiPres Workbench

Thorsted May 20

@anj have you tried adding File to the mix?@BertrandCaron @p3ter

Andy Jackson May 20

@Thorsted @BertrandCaron @p3ter yes, but its in the newer version of the index but I've not updated that page to use it yet https://www.digipres.org/workbench/formats/format-index

The Format Index (ALPHA) | DigiPres Workbench

Thorsted May 20

@anj @BertrandCaron @p3ter Right, I thought I remembered seeing File in the mix before. This is amazing work Andy.

Andy Jackson May 20

@Thorsted @BertrandCaron @p3ter thank you! Sorry it's not all up to date. Here's a more recent version of the uniqueness plot https://github.com/anjackson/unseen-formats/blob/main/src/data/2025-09-28-registries.unseen-uniqueness.svg

unseen-formats/src/data/2025-09-28-registries.unseen-uniqueness.svg at main · anjackson/unseen-formats

Applying 'unseen species' analysis for digital file formats and registries - anjackson/unseen-formats

GitHub

Thorsted May 20

@anj @BertrandCaron @p3ter Definitely not a one tool to rule them all. Thanks!

@anj @Thorsted @BertrandCaron I remember seeing that graph in one of your slides Andy at NTTW, right?

#Digital ⚓️ #Vagabond 🦈May 20

@Thorsted @p3ter

> I've never had any reason to use DROID or Siegfried yet: That's why I was asking

Good questions, and I think adding to Tyler's point, I think it's important to pluralize tooling where there are gaps and be comfortable with multiple tools contributing to a single metadata record.

What's nice about the PRONOM model, and maybe other models in future (Wikibase) is the decentralization of knowledge about formats through it's IDs/URIs.

#Digital ⚓️ #Vagabond 🦈May 20

@Thorsted @p3ter

At least the theory seems promising to reduce how much we have to duplicate and maintain across records.

PRONOM based tools may never be as precise as other methods though and you may simply be more comfortable in a context where you record information that is more granular from tools that parse.

The calculus is a little different in institutions live govt where there are fewer technical users and so PRONOM based systems work quite well reducing the technical barrier.

#Digital ⚓️ #Vagabond 🦈May 20

@Thorsted @p3ter (oh wow, just saw you got a bunch of replies, lol! Hope some of it helped!)

@beet_keeper Indeed! Those were exactly the things I was looking for.

Thanks @anj and @Thorsted

@beet_keeper ...oh, and since recently, I'm using my "holodex" #xattr key/value data for such things:

If you look at the screenshot, you'll see:
exiftool + mediainfo + other extracted data in one place.

I can query over this.
Brilliant. ❤️ ⭐️

@beet_keeper But to be clear: The filetype identification parts still rely on the tool-internal capabilities/patterns, which are then stored as-is in xattrs.

Do DROID or Siegfried distinguish between AV files with same container, but different stream encodings?

Thorsted May 20

@p3ter @beet_keeper DROID and Siegfried use the PROMON registry which is a simple pattern recognition method, no parsing of formats like MediaInfo or Exiftool. So there is no stream data for AV. Identification should lead to using better parsing tools.

Bertrand Caron May 20

@p3ter @beet_keeper I like it as a working tool, the way it manages different levels of precision is a bit confusing but useful.

#BnF uses it as its main identification tool with -i (MIME type), contribution is possible though I had trouble understanding the syntax.