@pervognsen this looks useful!
I've been slowly working on (mostly thinking about at this point) a new file classification tool that can understand 'libmagic' rules. There are something like 16k rules in there at present. I want to compile those all into one big matcher, instead of doing naive interpretation of the rules a la 'file'.
@chococo @pervognsen yes, a thought I've had is to use regex matching as a primitive in this file classification tool. (Hyperscan is tremendous and a great fit for this use case.)
Even doing that though, there will need to be a higher level of pattern matching on top, as not all of 'libmagic' rules can be reduced to regex matching + actions.
A couple challenges:
(1) recovering high-level structure from libmagic rules (e.g., converting a bunch of independent rules into one disjunctive match)
(2) translating as much of libmagic rules to regex matching as possible
I briefly checked out the spec of magic pattern definitions, and I get what you mean
It's a little different from the pattern matching problem as typically found in functional languages. But maybe you can find a way to make use of Maranget's ideas
@chococo indeed, the naive semantics of libmagic rules are as a sequence of conditionals like you show. The operators there allow more than just string matching, though. For example, input offset values can be indirect (or data-dependent). Those things seem to defy translation into regexes.
One pattern that shows up frequently in libmagic rules is a sequence of mutually exclusive tests against the same input value. With the naive chain-of-conditionals implementation, this ends up doing much needless work, as any one of those tests being true precludes all the others. An example here: https://github.com/file/file/blob/c5eb6d69490c7b1bce17f01a67b799e9577f0fe9/magic/Magdir/lua#L22. Having a compiler that could determine when two tests are mutually exclusive would be beneficial.
You're right that this is not immediately the same problem as compiling rich 'match' expressions structural decomposition and binders. But when dealing with a chain of 16k+ if-else statements, trying to compile that into something much more efficient, there may be more similarity. I'm happy for any inspiration and possibly relevant material, anyway :)
I've customized LLVM for my language, but it's my first time looking at this part of LLVM
Yeah, I think something like this could work for file type classification
I'm aware that instruction selection is one of the slowest passes in LLVM, and maybe now I know why ๐ It's expensive pattern matching done by interpreted bytecode programs