So my friend Chris taught me what byte pair encoding is and then I was idly thinking about it and I wonder if people can/do use this for generating bytecode interpreter superinstructions.

Anyway in the process of thinking about this I sketched the tiniest interpreter generator and superinstruction generator (but not finder)

https://gist.github.com/tekknolagi/8071878faa59514e3a2155bc8798638b

interp.c

GitHub Gist: instantly share code, notes, and snippets.

Gist
@tekknolagi Hmm, this sounds like a good situation for suffix arrays, augmented with longest-common-prefix arrays. There are algorithms to construct both in linear time, and then you can find the most frequently repeated substring of any given length in linear time as well. I think you can also quickly find the substring which maximizes the total length of that substring's repetitions, although I don't know whether that's also doable in linear time. And frankly I just think suffix arrays are neat and want to see somebody use them for something 😂
@jamey interesting, haven't heard of either of those!