i don't see enough people with one of the best tool improvements i've ever made for reverse engineering, so i had to write a blog post about it!

https://simonomi.dev/blog/color-code-your-bytes/

your hex editor should color-code bytes

@simonomi i'm curious about the statement here:

the bitstream is much more colorful and chaotic because good compression algorithms output data that looks visually random.

not disputing its correctness but this is a very nontrivial claim described in visual terms that are somewhat removed from the discussion just above regarding prefix codes. i'm curious about how you arrived at this and in particular if your reverse engineering work motivated this intuitive description

@simonomi i think prefix trees are supposed to be pretty standard nowadays but i'm particularly under the impression that older formats employed hand-rolled heuristics and i'm wondering if this is what you're speaking to with the discussion of visual randomness here
@simonomi very much not a reverse engineering expert but have done some binary parsing and been frustrated with the expressiveness of languages for this task. scheme has some interesting work in this area but rust is my experience and could stand to do better

@hipsterelectron it mostly came from the intuition of having look at so many different types of binary. stuff with really high information density (compressed, executable, media, etc) tends to look very busy, because there's simply more information squished into fewer bytes

i've ended up writing my own whole fancy binary parsing system for my tool carbonizer. it's pretty specialized to the patterns used in the game files for Fossil Fighters, but i'm reasonably happy with it overall

@simonomi we developed a similar framework for the zip crate which was initially just to avoid multiple reads for data of known size but also supports safe and very performant SIMD searching for magic bytes (every single impl i've seen for zips works byte-by-byte and has trouble with the file comment). took some more work to use the same specification for writes
@simonomi i totally support bespoke parsing code and did similarly for zstd, which in particular has some data-dependent variable-length bit arithmetic that's especially hard to encode. i think it's useful to separate fetching blocks of input from parsing but the parsing itself seems easier to maintain without trying to genericize format specifics