This is a nice overview of Maranget's algorithm for compiling pattern matching: https://compiler.club/compiling-pattern-matching/
Compiling Pattern Matching

I am a computer programmer. I'm primarily interested in compilers for strict functional languages.

@pervognsen this looks useful!

I've been slowly working on (mostly thinking about at this point) a new file classification tool that can understand 'libmagic' rules. There are something like 16k rules in there at present. I want to compile those all into one big matcher, instead of doing naive interpretation of the rules a la 'file'.

@bradlarsen @pervognsen Maybe a pattern matching with regular expressions would be sufficient for your case?

@bradlarsen @pervognsen

Saying only 'pattern matching' could be a little confusing, as it could actually mean different different things

Maranget's algorithm works for data types that are typically found in functional programming languages, which can represent complex data structures like trees

Regular expression works for strings, as like the content of a file

I think both are well-understood problems

@chococo @pervognsen yes, a thought I've had is to use regex matching as a primitive in this file classification tool. (Hyperscan is tremendous and a great fit for this use case.)

Even doing that though, there will need to be a higher level of pattern matching on top, as not all of 'libmagic' rules can be reduced to regex matching + actions.

A couple challenges:

(1) recovering high-level structure from libmagic rules (e.g., converting a bunch of independent rules into one disjunctive match)

(2) translating as much of libmagic rules to regex matching as possible

@bradlarsen

I briefly checked out the spec of magic pattern definitions, and I get what you mean

It's a little different from the pattern matching problem as typically found in functional languages. But maybe you can find a way to make use of Maranget's ideas

@bradlarsen

As for regular expressions, there can be various ways to extend them, at varying performance costs. For example you could have a special matching token for a constant offset in the string

With this approach, it feels inefficient to feed the whole file content into the underlying DFA, and you would need to think of ways to optimize

@chococo indeed, the naive semantics of libmagic rules are as a sequence of conditionals like you show. The operators there allow more than just string matching, though. For example, input offset values can be indirect (or data-dependent). Those things seem to defy translation into regexes.

One pattern that shows up frequently in libmagic rules is a sequence of mutually exclusive tests against the same input value. With the naive chain-of-conditionals implementation, this ends up doing much needless work, as any one of those tests being true precludes all the others. An example here: https://github.com/file/file/blob/c5eb6d69490c7b1bce17f01a67b799e9577f0fe9/magic/Magdir/lua#L22. Having a compiler that could determine when two tests are mutually exclusive would be beneficial.

You're right that this is not immediately the same problem as compiling rich 'match' expressions structural decomposition and binders. But when dealing with a chain of 16k+ if-else statements, trying to compile that into something much more efficient, there may be more similarity. I'm happy for any inspiration and possibly relevant material, anyway :)

file/magic/Magdir/lua at c5eb6d69490c7b1bce17f01a67b799e9577f0fe9 ยท file/file

Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintai...

GitHub

@chococo My immediate motivation for fast file classification is to be able to collect that information when scanning in Nosey Parker, without drastically slowing things down. Using 'libmagic' directly seems to limit throughput to a few hundred MB/s.

I haven't seen any general-purpose file classifier that is designed with speed in mind.

https://github.com/praetorian-inc/noseyparker

GitHub - praetorian-inc/noseyparker: Nosey Parker is a command-line tool that finds secrets and sensitive information in textual data and Git history.

Nosey Parker is a command-line tool that finds secrets and sensitive information in textual data and Git history. - praetorian-inc/noseyparker

GitHub

@bradlarsen

One way to improve is to compile the rules to native executable binaries, instead of interpreting them (Think of the rules like a kind of program)

I am feeling sorry that I made inadequate suggestions. I should have recognized that this is not a problem really familiar to me

@chococo No problem! It's an unusual problem and I haven't run across any existing obvious solution. I'm happy for the conversation and pointers.

@bradlarsen

Actually, after a little more thought, I think introducing regular expressions into solving the high level classification problem doesn't really make sense. Much of the power of regular expressions really lies in expressing repeating string patterns (e.g. "a*" "a+"), but the rules of libmagic are not repeating

It's basically a decision tree, that needs to be somehow optimized. You are right that Maranget's algorithm could be a good inspiration source

@bradlarsen @chococo I wonder about looking at peepholes/pattern matching in compilers. There's a bit of a write-up of how patterns in LLVM are compiled under "The pattern matching mechanism" https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1 here.
A deeper look into the LLVM code generator, Part 1 - Eli Bendersky's website

@asb @bradlarsen

I've customized LLVM for my language, but it's my first time looking at this part of LLVM

Yeah, I think something like this could work for file type classification

I'm aware that instruction selection is one of the slowest passes in LLVM, and maybe now I know why ๐Ÿ˜‚ It's expensive pattern matching done by interpreted bytecode programs