I have been enjoying the recent threads by @pervognsen about instruction-level efficiency on computers over the years, so I'm going to do a brief one of my own based on some of my recent adventures.

It'll end up touching on topics that are coming up today and on some conversations I'd had some years ago.

In 2018, I wrote a cellular automaton program for the Sega Genesis (8MHz 68000 CPU), and in 2023 I wrote a version for the SNES (3-ish MHz 65816 CPU).

#retrocomputing #genesis #snes

SNES was 20% faster. Why? At the time we'd batted around some theories, with two very solid reasons why it might win.

The first is that the SNES CPU is tuned to hit memory every processor cycle, so a lower speed can still be faster. The Genesis *wasn't.* At memory speeds it was "really" only 2MHz.

The M68K had a proper 16-bit data bus, though, so word-level work would be faster. Alas, the simulation code was all bytework, so the SNES enjoys 2.7MHz speed to RAM and 3.6MHz to ROM, so is faster.

There is also the fact that I'd *started* by baking in the algorithmic optimizations I'd already done on the Sega and iterated further, and beyond that (since I'm Pretty Darn OK at 65xx code, if not exactly one of the greats) I'd managed to implement it all with no register spills.

There were some fun moments there; it turned out that even when you hit memory every cycle, it was occasionally STILL faster to recompute values rather than cache and restore them.

Here in 2025, I've decided to return to my original m68k code and see if I could bring back some of what I'd learned since. It was, after all, my first "real" m68k program, and I then went on to pick up a few years of hobby experience on the Amiga, Classic Mac, and Atari ST.

In returning to my old code, I find a third possible explanation: I was *real* bad at this back then.

After tightening up my code and porting back in the portable improvements that I'd made SNES-side, now the m68k wins.

At a high level, the m68k's advantage turns out to be that while the memory is slower (2MHz, as opposed to the 3.6MHz FastROM speed on the SNES), each cycle brings in *a word* of instructions because *the word* and not the byte is the unit of instructions. That gives us rough parity on transfer speed and the larger instruction words means the code *density* in "instruction bytes per second" can take an edge. Smaller instructions do more, even as the instructions are maybe larger.

The big win for me at the instruction level was realizing that post-incrementing pointers is free: reading or writing "(a0)+" is the same cost as "(a0)" since the add happens under the hood. Incrementing with its own instruction costs *8* cycles, and that's nearly a whole frame wasted in my most expensive loop.

Better yet, while switching to post-increment broke subsequent indexes, it removed one index entirely, saving 4 bytes, 4 cycles, and an additional half a frame.

That wasn't enough to catch up to the SNES, though, which had a full 20% advantage over the Sega... and even though it turns out that one clever little trick I did (which didn't port over to the m68k) that let me remove a loop variable turned out to save enough time to produce the whole disparity on its own. No, the major gain was in porting over the changes I made to the NON-central loop.

The one with 500 iterations instead of 16,000.

The one I could be sloppy about because it doesn't matter.

That code was VERY MUCH written to just be Definitely Correct instead of fast. It had quite a few procedure calls in it. In my revisit I calculated that the call/return instructions ALONE were enough to account for an entire frame of lag, and some other individual expensive instructions added up to sizable fractions of one. As part of avoiding register spilling on the SNES, all of that code was entirely gone, thanks to, essentially, extremely aggressive induction variable optimizations.
That code ported very neatly over to the Sega; I'd already managed to jam the code into 3 registers and now I had 15. A few decisions go the other way; recomputing some values is cheaper than spilling to memory, but "spilling" to OTHER REGISTERS is cheaper still. Once it was in place? A 20-25% speedup over the original, meeting and sometimes exceeding the SNES capabilities, but nevertheless broadly at par.

There's apparently an old joke about how the Genesis design is cleaner but more primitive but with a vastly overpowered CPU compared to the SNES and its wild array of three-quarters-baked bespoke hardware. The SNES was "a Lamborghini with a lawnmower engine", and the Sega was "a lawnmower with a Lamborghini engine."

It's a good line, but I think there may be less justice to it than my first impressions of the platforms suggested.

/end. Thanks for reading to the end, if you made it this far!