GDC 2016: "Taming the Jaguar: x86 Optimization at Insomniac Games" by Andreas Fredriksson (@deplinenoise) of Insomniac Games https://gdcvault.com/play/1023340/Taming-the-Jaguar-x86-Optimization

I thought this was really good, because it discussed CPU microarchitecture in an understandable way.

The first section was about the frontend of the chip, which is the part that fetches instructions. The Jaguar can fetch 2 instructions per cycle, which can actually be a bottleneck if you're doing a bunch of quick math on registers.

1/6

Taming the Jaguar: x86 Optimization at Insomniac Games

In this session the low-level optimizations in the AMD Jaguar CPU used in PS4 and XBOX ONE will be analyzed. Optimizing for the out of order Jaguar CPU is very different from previous console CPUs, and in this session a few key optimization...

Later in the presentation, the presenter gives an example where adding a prefetch instruction actually slows the program down, just because it was bottlenecked on instruction fetch, and the additional prefetch instruction made that part worse.

The next section was about instruction retirement. This is the part that I thought was most interesting, because I don't know much about it. Instructions have to appear to complete (or "retire") in program order.

2/6

So the whole chip is prefetching instructions and speculatively executing what they say, and then when the results are computed, they enter into the retire unit where they hang around until they are allowed to "retire" observably in program order.

The interesting thing here is that the retire unit only holds 64 entries, but fetching from memory takes 200 cycles at a minimum. That means we can't (even nearly!) hide memory latency by running concurrent ALU.

3/6

So, if you're going to hit memory, you'll want to manually schedule either a) expensive instructions, or b) other memory operations, concurrently, to maximize parallelism of the chip. So interesting!!!

The previous diagram also lists the number of unnamed registers, which is less than I expected (72 and 64); I expected there would be hundreds. Interesting!

Another really interesting part: it turns out that interacting with the cache is done in transactions.

4/6

Each cycle can only issue a single one of these cache transactions. This is why cached SIMD loads are faster than cached scalar loads: the bottleneck isn't actually the throughput of the cache, but is instead the setup time of the cache transactions. You get 4x more data out of the cache per cycle than you would if you issued scalar loads. Wild!!!

The rest of the presentation was working through examples of some common algorithms on some common data structures.

5/6

The main takeaways are mostly:
- Linked lists defeat the out of order capabilities of the processor, and there's basically nothing you can do about it
- The hardware is already optimized for iterating through arrays, so there's not really anything you can do to speed that up

Review: 10/10 I learned a lot!!! The specific numbers aren't really relevant nowadays, but the approach he used to compare the throughput of different units, and how to measure, were really illustrative.

@GDCPresoReviews Jaguar is a tiny and very slow core by modern standards. "hundreds" is correct for modern cores, e.g. here's Zen 4 diagram.
@zeux 224 and 192, right
@GDCPresoReviews @zeux FWIW even Zen 4 at 320-entry retire queue is quite moderate by current big-core standards, Apple M1 is 600+ instructions in retire queue https://dougallj.github.io/applecpu/firestorm.html, Intel Skymont (which is an "E-core", i.e. the smaller ones!) is >400 https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky
Firestorm Overview