@beet_keeper :

What's your opinion on using the default unix "file" tool vs DROID to roughly-and-quicky-identify mixed data sets?

🙋‍♀️ ❓

@p3ter I like it because it's a bit easier to use, or I compromise and use Siegfried because of Siegfried's structured output.

If you're not going to persist the identifiers, or need to conform to a digital preservation standard and use the PUID then file good, quick and easy.

RE: processing you can maybe check out https://kellyjonbrazil.github.io/jc/ which allows you to pip file output to json, e.g. file * | jc --file

Which helps make it a bit easier to process by machine.

jc

CLI tool and python library that converts the output of popular command-line tools, file-types, and common strings to JSON, YAML, or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

jc

@beet_keeper to clarify what you mean:

"you like *it*" means droid or file?

(Haven't used Siegfried yet. Will try. This one, right? https://github.com/richardlehane/siegfried)

I've had quite some cases with AV files where file gets it, but DROID doesn't...

GitHub - richardlehane/siegfried: signature-based file format identification

signature-based file format identification. Contribute to richardlehane/siegfried development by creating an account on GitHub.

GitHub
@p3ter yeah, file is good. You have to use all the tools. If DROID doesn’t get it Siegfried probably won’t either although it has a few more methods.
@p3ter do you use MediaInfo for AV identification too? (I've always thought was was more likely to be the most reliable but I guess doesn't really give much on non-AV/heterogeneous collections)

@beet_keeper I use MediaInfo and `file` and exiftool all the time.

Therefore I've never had any reason to use DROID or Siegfried yet: That's why I was asking 😉

@p3ter @beet_keeper They are not mutually exclusive. The PRONOM registry will have signatures that file and other tools are not aware of yet.
@p3ter as an exemple, the QuarkXPress 3.0 files are only identified by PRONOM (thanks @Thorsted!) and TrID, but neither Tika nor File know about it!
@BertrandCaron @p3ter @Thorsted every format information source I've looked at has a significant fraction of unique entries https://www.digipres.org/workbench/registries/compare
Comparing Format Registries | DigiPres Workbench

@anj have you tried adding File to the mix?@BertrandCaron @p3ter
@Thorsted @BertrandCaron @p3ter yes, but its in the newer version of the index but I've not updated that page to use it yet https://www.digipres.org/workbench/formats/format-index
The Format Index (ALPHA) | DigiPres Workbench

@anj @BertrandCaron @p3ter Right, I thought I remembered seeing File in the mix before. This is amazing work Andy.
@Thorsted @BertrandCaron @p3ter thank you! Sorry it's not all up to date. Here's a more recent version of the uniqueness plot https://github.com/anjackson/unseen-formats/blob/main/src/data/2025-09-28-registries.unseen-uniqueness.svg
unseen-formats/src/data/2025-09-28-registries.unseen-uniqueness.svg at main · anjackson/unseen-formats

Applying 'unseen species' analysis for digital file formats and registries - anjackson/unseen-formats

GitHub
@anj @BertrandCaron @p3ter Definitely not a one tool to rule them all. Thanks!
@anj @Thorsted @BertrandCaron I remember seeing that graph in one of your slides Andy at NTTW, right?