This is a nice overview of Maranget's algorithm for compiling pattern matching: https://compiler.club/compiling-pattern-matching/
Compiling Pattern Matching

I am a computer programmer. I'm primarily interested in compilers for strict functional languages.

@pervognsen this looks useful!

I've been slowly working on (mostly thinking about at this point) a new file classification tool that can understand 'libmagic' rules. There are something like 16k rules in there at present. I want to compile those all into one big matcher, instead of doing naive interpretation of the rules a la 'file'.

@bradlarsen @pervognsen Maybe a pattern matching with regular expressions would be sufficient for your case?

@chococo @pervognsen yes, a thought I've had is to use regex matching as a primitive in this file classification tool. (Hyperscan is tremendous and a great fit for this use case.)

Even doing that though, there will need to be a higher level of pattern matching on top, as not all of 'libmagic' rules can be reduced to regex matching + actions.

A couple challenges:

(1) recovering high-level structure from libmagic rules (e.g., converting a bunch of independent rules into one disjunctive match)

(2) translating as much of libmagic rules to regex matching as possible

@bradlarsen

I briefly checked out the spec of magic pattern definitions, and I get what you mean

It's a little different from the pattern matching problem as typically found in functional languages. But maybe you can find a way to make use of Maranget's ideas

@chococo indeed, the naive semantics of libmagic rules are as a sequence of conditionals like you show. The operators there allow more than just string matching, though. For example, input offset values can be indirect (or data-dependent). Those things seem to defy translation into regexes.

One pattern that shows up frequently in libmagic rules is a sequence of mutually exclusive tests against the same input value. With the naive chain-of-conditionals implementation, this ends up doing much needless work, as any one of those tests being true precludes all the others. An example here: https://github.com/file/file/blob/c5eb6d69490c7b1bce17f01a67b799e9577f0fe9/magic/Magdir/lua#L22. Having a compiler that could determine when two tests are mutually exclusive would be beneficial.

You're right that this is not immediately the same problem as compiling rich 'match' expressions structural decomposition and binders. But when dealing with a chain of 16k+ if-else statements, trying to compile that into something much more efficient, there may be more similarity. I'm happy for any inspiration and possibly relevant material, anyway :)

file/magic/Magdir/lua at c5eb6d69490c7b1bce17f01a67b799e9577f0fe9 ยท file/file

Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintai...

GitHub
@bradlarsen @chococo I wonder about looking at peepholes/pattern matching in compilers. There's a bit of a write-up of how patterns in LLVM are compiled under "The pattern matching mechanism" https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1 here.
A deeper look into the LLVM code generator, Part 1 - Eli Bendersky's website

@asb @bradlarsen

I've customized LLVM for my language, but it's my first time looking at this part of LLVM

Yeah, I think something like this could work for file type classification

I'm aware that instruction selection is one of the slowest passes in LLVM, and maybe now I know why ๐Ÿ˜‚ It's expensive pattern matching done by interpreted bytecode programs