GDC 2016: "Taming the Jaguar: x86 Optimization at Insomniac Games" by Andreas Fredriksson (@deplinenoise) of Insomniac Games https://gdcvault.com/play/1023340/Taming-the-Jaguar-x86-Optimization

I thought this was really good, because it discussed CPU microarchitecture in an understandable way.

The first section was about the frontend of the chip, which is the part that fetches instructions. The Jaguar can fetch 2 instructions per cycle, which can actually be a bottleneck if you're doing a bunch of quick math on registers.

1/6

Taming the Jaguar: x86 Optimization at Insomniac Games

In this session the low-level optimizations in the AMD Jaguar CPU used in PS4 and XBOX ONE will be analyzed. Optimizing for the out of order Jaguar CPU is very different from previous console CPUs, and in this session a few key optimization...

Later in the presentation, the presenter gives an example where adding a prefetch instruction actually slows the program down, just because it was bottlenecked on instruction fetch, and the additional prefetch instruction made that part worse.

The next section was about instruction retirement. This is the part that I thought was most interesting, because I don't know much about it. Instructions have to appear to complete (or "retire") in program order.

2/6

So the whole chip is prefetching instructions and speculatively executing what they say, and then when the results are computed, they enter into the retire unit where they hang around until they are allowed to "retire" observably in program order.

The interesting thing here is that the retire unit only holds 64 entries, but fetching from memory takes 200 cycles at a minimum. That means we can't (even nearly!) hide memory latency by running concurrent ALU.

3/6

So, if you're going to hit memory, you'll want to manually schedule either a) expensive instructions, or b) other memory operations, concurrently, to maximize parallelism of the chip. So interesting!!!

The previous diagram also lists the number of unnamed registers, which is less than I expected (72 and 64); I expected there would be hundreds. Interesting!

Another really interesting part: it turns out that interacting with the cache is done in transactions.

4/6

@GDCPresoReviews Jaguar is a tiny and very slow core by modern standards. "hundreds" is correct for modern cores, e.g. here's Zen 4 diagram.
@zeux 224 and 192, right
@GDCPresoReviews @zeux FWIW even Zen 4 at 320-entry retire queue is quite moderate by current big-core standards, Apple M1 is 600+ instructions in retire queue https://dougallj.github.io/applecpu/firestorm.html, Intel Skymont (which is an "E-core", i.e. the smaller ones!) is >400 https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky
Firestorm Overview