Mastodawn

R2DHue Apr 6, 2025

@dougall
I would like to amend my above post to say it’s stupid and should be ignored.

There *is no* “straight execution path” to AMX hardware because, despite its apparent physical separation from the CPU cluster in die shots, Apple considers AMX *part* of its CPUs — like an FPU or NEON hardware.

AMX is a “slave” to the CPU because it *is* the CPU for all intents and purposes.

And if Apple fully embraces SVE1/2, AMX will be superannuated and may stick around purely for legacy compatibility.

Show thread

R2DHue Apr 6, 2025

@wadiest @atpfm

Apple doesn’t consider AMX a ”discrete” coprocessor or processor like its GPUs or NPU but considers AMX part of the CPU — which is confusing considering people have identified a distinct “block” with 4 specialized ALUs appearing apart from the CPU cluster on the floorplan.

Nevertheless, AMX is CPU controlled logic that’s like a CPUs FPU or NEON hw.

It’ll be interesting to see if AMX becomes obsolete if Apple fully embraces ARM’s SME/2 in ARMv9.# in future SOCs e.g. M5…

R2DHue Mar 24, 2025

@dougall So, since your famous (infamous) reverse engineering and analysis of Apple’s AMX for the M1 & A14, have you discovered any hardware improvements to this unit up through the M4 & A18?

“Multicore” AMX would be too wasteful/expensive in die area/power/thermal/complexity respects, bt how about notable architectural improvements 2its internal logic units: wider paths, more lanes, on-chip SRAM or DMA — things that cannot be accounted for by die process shrink and higher clock speeds alone(?)

Show thread

R2DHue Mar 24, 2025

@donni

What are you talking about ⁉️

Skeletons have to do regular white loads of themselves so as to maintain their sparkling white sheen and springtime-y fresh scent!

R2DHue Mar 22, 2025

@dougall
Learned some things down the rabbit hole.

Despite what LLVM says, M4 is somewhere between ARMv9.2-A and 9.4-A compliant. (SSV not required.)

SME/SME2 *DOES* bring matrix math HARDWARE enhancements to ARM CPUs. Apple’s now using some SME or even SME2, and a FEW matrix ops actually require a *touch* of SVE(!); Apple may eliminate its AMX coprocessor altogether in future SoCs and go all-in with ARM SME/2! (Stick with Accelerate.)

 will still have custom GPU & NPU designs in its “moat.”

R2DHue Mar 21, 2025

@dougall

—BUT—through Accelerate, matrix will be executed on Apple’s AMX(s).*

*or perhaps on the NPU for lower precision or the GPU, as Accelerate “efficiently” determines (presumably)

R2DHue Mar 21, 2025

@dougall

re: “The weird bit, AMX is still present on the M4, along with SME”

My understanding is that ARM9 SME/SME2 are architecturally defined for the »CPU« so these “matrix” extensions execute ONLY on the CPU!

Apple could have additional proprietary matrix instructions that execute instead on the AMX blocks.
This suggests  SHOULDN’T encourage devs to use SME/SME2 because those can only be done on the CPU (not even the NPU!)—BUT—through Accelerate, matrix will be executed on Apple’s AMX(s).

R2DHue Mar 20, 2025

@dougall

btw, “E cores having smaller, slower, efficient AMX coprocessors” sounds strange to me—but what do I know…

I suppose a speed mismatch would be a problem.

My understanding is ARMv9(.2-A)’s 70 or so SME/SME2 instructions happen on the CPU like any other instruction; Apple’s undocumented AMX instructions on the other hand happen on its dedicated Matrix coprocessor—a block stock ARM Cortex SoCs lack.

It may be an unpopular opinion, but I think Apple’s right to steer devs to Accelerate…

R2DHue Mar 20, 2025

thx, @dougall
1/2
Everything “discrete matrix multiplication unit” remains murky for Apple’s proprietary AMX block or anyone else’s for that matter—afaik.

There’s no stock “ARM ‘Cortex’ Matrix Unit” reference design; every licensee bakes its own—ARM doesn’t have an off-the-shelf one like it does stock CPU/GPUs.

9.2-A’s SME is only that—a SOFTWARE instruction set. And Scalable Matrix Extension calls are executed on ARM CPUs—only

Apple AMX instructions, otoh, have a straight execution path—no?

R2DHue Mar 20, 2025

@dougall
2/2
This, along with SVE2, Realm/CCA, MTE, BTI2… is I think ARM trying to beef up its CPU—albeit at the expense of RISC—in hopes of maintaining the relevance of its venerable CPUs in the face of increasing threats from NVIDIA GPUs, Google TPUs, by NPUs, custom silicon from Amazon—even Meta—and others esp. in the AI/HPC space.

But is it Apple’s job to keep ARM CPUs relevant vis-à-vis competition by non-CPU processors by competitors, or is that ARM’s problem? 🤔 Food for thought anyway…