@pervognsen this looks useful!
I've been slowly working on (mostly thinking about at this point) a new file classification tool that can understand 'libmagic' rules. There are something like 16k rules in there at present. I want to compile those all into one big matcher, instead of doing naive interpretation of the rules a la 'file'.
Saying only 'pattern matching' could be a little confusing, as it could actually mean different different things
Maranget's algorithm works for data types that are typically found in functional programming languages, which can represent complex data structures like trees
Regular expression works for strings, as like the content of a file
I think both are well-understood problems
@chococo @pervognsen yes, a thought I've had is to use regex matching as a primitive in this file classification tool. (Hyperscan is tremendous and a great fit for this use case.)
Even doing that though, there will need to be a higher level of pattern matching on top, as not all of 'libmagic' rules can be reduced to regex matching + actions.
A couple challenges:
(1) recovering high-level structure from libmagic rules (e.g., converting a bunch of independent rules into one disjunctive match)
(2) translating as much of libmagic rules to regex matching as possible
I briefly checked out the spec of magic pattern definitions, and I get what you mean
It's a little different from the pattern matching problem as typically found in functional languages. But maybe you can find a way to make use of Maranget's ideas
As for regular expressions, there can be various ways to extend them, at varying performance costs. For example you could have a special matching token for a constant offset in the string
With this approach, it feels inefficient to feed the whole file content into the underlying DFA, and you would need to think of ways to optimize
@chococo indeed, the naive semantics of libmagic rules are as a sequence of conditionals like you show. The operators there allow more than just string matching, though. For example, input offset values can be indirect (or data-dependent). Those things seem to defy translation into regexes.
One pattern that shows up frequently in libmagic rules is a sequence of mutually exclusive tests against the same input value. With the naive chain-of-conditionals implementation, this ends up doing much needless work, as any one of those tests being true precludes all the others. An example here: https://github.com/file/file/blob/c5eb6d69490c7b1bce17f01a67b799e9577f0fe9/magic/Magdir/lua#L22. Having a compiler that could determine when two tests are mutually exclusive would be beneficial.
You're right that this is not immediately the same problem as compiling rich 'match' expressions structural decomposition and binders. But when dealing with a chain of 16k+ if-else statements, trying to compile that into something much more efficient, there may be more similarity. I'm happy for any inspiration and possibly relevant material, anyway :)
@chococo My immediate motivation for fast file classification is to be able to collect that information when scanning in Nosey Parker, without drastically slowing things down. Using 'libmagic' directly seems to limit throughput to a few hundred MB/s.
I haven't seen any general-purpose file classifier that is designed with speed in mind.
One way to improve is to compile the rules to native executable binaries, instead of interpreting them (Think of the rules like a kind of program)
I am feeling sorry that I made inadequate suggestions. I should have recognized that this is not a problem really familiar to me
Actually, after a little more thought, I think introducing regular expressions into solving the high level classification problem doesn't really make sense. Much of the power of regular expressions really lies in expressing repeating string patterns (e.g. "a*" "a+"), but the rules of libmagic are not repeating
It's basically a decision tree, that needs to be somehow optimized. You are right that Maranget's algorithm could be a good inspiration source
I've customized LLVM for my language, but it's my first time looking at this part of LLVM
Yeah, I think something like this could work for file type classification
I'm aware that instruction selection is one of the slowest passes in LLVM, and maybe now I know why ๐ It's expensive pattern matching done by interpreted bytecode programs