Mastodawn

GDC 2016: "Taming the Jaguar: x86 Optimization at Insomniac Games" by Andreas Fredriksson (@deplinenoise) of Insomniac Games https://gdcvault.com/play/1023340/Taming-the-Jaguar-x86-Optimization

I thought this was really good, because it discussed CPU microarchitecture in an understandable way.

The first section was about the frontend of the chip, which is the part that fetches instructions. The Jaguar can fetch 2 instructions per cycle, which can actually be a bottleneck if you're doing a bunch of quick math on registers.

1/6

Taming the Jaguar: x86 Optimization at Insomniac Games

In this session the low-level optimizations in the AMD Jaguar CPU used in PS4 and XBOX ONE will be analyzed. Optimizing for the out of order Jaguar CPU is very different from previous console CPUs, and in this session a few key optimization...

Show thread

GDC Presentation Reviews 5d ago

Later in the presentation, the presenter gives an example where adding a prefetch instruction actually slows the program down, just because it was bottlenecked on instruction fetch, and the additional prefetch instruction made that part worse.

The next section was about instruction retirement. This is the part that I thought was most interesting, because I don't know much about it. Instructions have to appear to complete (or "retire") in program order.

2/6

Show thread

GDC Presentation Reviews

So the whole chip is prefetching instructions and speculatively executing what they say, and then when the results are computed, they enter into the retire unit where they hang around until they are allowed to "retire" observably in program order.

The interesting thing here is that the retire unit only holds 64 entries, but fetching from memory takes 200 cycles at a minimum. That means we can't (even nearly!) hide memory latency by running concurrent ALU.

3/6

Show thread

GDC Presentation Reviews 5d ago

So, if you're going to hit memory, you'll want to manually schedule either a) expensive instructions, or b) other memory operations, concurrently, to maximize parallelism of the chip. So interesting!!!

The previous diagram also lists the number of unnamed registers, which is less than I expected (72 and 64); I expected there would be hundreds. Interesting!

Another really interesting part: it turns out that interacting with the cache is done in transactions.

4/6

Show thread

GDC Presentation Reviews 5d ago

Each cycle can only issue a single one of these cache transactions. This is why cached SIMD loads are faster than cached scalar loads: the bottleneck isn't actually the throughput of the cache, but is instead the setup time of the cache transactions. You get 4x more data out of the cache per cycle than you would if you issued scalar loads. Wild!!!

The rest of the presentation was working through examples of some common algorithms on some common data structures.

5/6

Show thread

GDC Presentation Reviews 5d ago

The main takeaways are mostly:
- Linked lists defeat the out of order capabilities of the processor, and there's basically nothing you can do about it
- The hardware is already optimized for iterating through arrays, so there's not really anything you can do to speed that up

Review: 10/10 I learned a lot!!! The specific numbers aren't really relevant nowadays, but the approach he used to compare the throughput of different units, and how to measure, were really illustrative.

Show thread

Arseny Kapoulkine 5d ago

@GDCPresoReviews Jaguar is a tiny and very slow core by modern standards. "hundreds" is correct for modern cores, e.g. here's Zen 4 diagram.

Show thread

GDC Presentation Reviews 5d ago

@zeux 224 and 192, right

Show thread

Fabian Giesen 5d ago

@GDCPresoReviews @zeux FWIW even Zen 4 at 320-entry retire queue is quite moderate by current big-core standards, Apple M1 is 600+ instructions in retire queue https://dougallj.github.io/applecpu/firestorm.html, Intel Skymont (which is an "E-core", i.e. the smaller ones!) is >400 https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky

Taming the Jaguar: x86 Optimization at Insomniac Games

Firestorm Overview