@SnoopJ and re. design, the important realization is that "decoder library" as a separate function, separate binary, anything like that, it's probably grossly bad for throughput! so yaxpeax-sm83 has this callback thing i have no idea how to generalize: https://github.com/iximeow/yaxpeax-sm83/blob/no-gods-no-/src/lib.rs#L636-L724
so this specializes with the cpu impl: https://github.com/iximeow/yaxgbc/blob/no-gods-no-/src/cpu.rs#L34 so that there aren't separate decode/execute functions :D what was like 20 instructions to decode, 30 to do ABI glue + call/ret, 20 instructions to emulate, that all turned into like 15 instructions to decode and emulate most ops