Hello you fine Internet folks,

Today's article is looking at the last custom core from Qualcomm, Kryo, where we look at the BPU, ROB, and more core structures. Along with comparing it to the contemporary Cortex A72.

Hope y'all enjoy!

https://chipsandcheese.com/2023/07/12/kryo-qualcomms-last-in-house-mobile-core/

Kryo: Qualcomm’s Last In-House Mobile Core

CPU design is hard. You can tell because there aren’t a lot of companies doing it. AMD and Intel are your only choices in the PC scene. In the Android ecosystem, ARM Ltd’s cores dominat…

Chips and Cheese

@chipsandcheese Cortex-A72 FMA latency is 7 cycles but with a huge asterisk:

ARMs design has a multiplier computing the unrounded result for the first 4 cycles and then the addition in the last 3 cycles. Crucially, the addition dependency is tracked separately, and accumulations onto the same register therefore issue at a rate of one FMA every 3 cycles, not one every 7 (or 5).

For practical purposes (i.e. dot product/matrix etc. ops) it really acts more as a 3-cycle latency than 7.

@chipsandcheese ARMs designs have been using this high-level approach for a long time (see e.g. http://www.acsel-lab.com/arithmetic/arith20/papers/ARITH20_Lutz.pdf from 2011) and with good reason; it's a very sensible trade-off, and also avoids performance anomalies due to critical paths getting longer when compilers aggressively fuse into FMAs, since in the ARM designs a FMA is never higher latency than dependent FMUL->FADD.