@R2DHue

0 Followers
7 Following
13 Posts
Unapologetic Apple fanboi since my age was in single digits. 

@dougall
I would like to amend my above post to say it’s stupid and should be ignored.

There *is no* “straight execution path” to AMX hardware because, despite its apparent physical separation from the CPU cluster in die shots, Apple considers AMX *part* of its CPUs — like an FPU or NEON hardware.

AMX is a “slave” to the CPU because it *is* the CPU for all intents and purposes.

And if Apple fully embraces SVE1/2, AMX will be superannuated and may stick around purely for legacy compatibility.

@dougall So, since your famous (infamous) reverse engineering and analysis of Apple’s AMX for the M1 & A14, have you discovered any hardware improvements to this unit up through the M4 & A18?

“Multicore” AMX would be too wasteful/expensive in die area/power/thermal/complexity respects, bt how about notable architectural improvements 2its internal logic units: wider paths, more lanes, on-chip SRAM or DMA — things that cannot be accounted for by die process shrink and higher clock speeds alone(?)

@dougall
Learned some things down the rabbit hole.

Despite what LLVM says, M4 is somewhere between ARMv9.2-A and 9.4-A compliant. (SSV not required.)

SME/SME2 *DOES* bring matrix math HARDWARE enhancements to ARM CPUs. Apple’s now using some SME or even SME2, and a FEW matrix ops actually require a *touch* of SVE(!); Apple may eliminate its AMX coprocessor altogether in future SoCs and go all-in with ARM SME/2! (Stick with Accelerate.)

 will still have custom GPU & NPU designs in its “moat.”

@dougall

—BUT—through Accelerate, matrix will be executed on Apple’s AMX(s).*

*or perhaps on the NPU for lower precision or the GPU, as Accelerate “efficiently” determines (presumably)

@dougall

re: “The weird bit, AMX is still present on the M4, along with SME”

My understanding is that ARM9 SME/SME2 are architecturally defined for the »CPU« so these “matrix” extensions execute ONLY on the CPU!

Apple could have additional proprietary matrix instructions that execute instead on the AMX blocks.
This suggests  SHOULDN’T encourage devs to use SME/SME2 because those can only be done on the CPU (not even the NPU!)—BUT—through Accelerate, matrix will be executed on Apple’s AMX(s).

@dougall

btw, “E cores having smaller, slower, efficient AMX coprocessors” sounds strange to me—but what do I know…

I suppose a speed mismatch would be a problem.

My understanding is ARMv9(.2-A)’s 70 or so SME/SME2 instructions happen on the CPU like any other instruction; Apple’s undocumented AMX instructions on the other hand happen on its dedicated Matrix coprocessor—a block stock ARM Cortex SoCs lack.

It may be an unpopular opinion, but I think Apple’s right to steer devs to Accelerate…

thx, @dougall
1/2
Everything “discrete matrix multiplication unit” remains murky for Apple’s proprietary AMX block or anyone else’s for that matter—afaik.

There’s no stock “ARM ‘Cortex’ Matrix Unit” reference design; every licensee bakes its own—ARM doesn’t have an off-the-shelf one like it does stock CPU/GPUs.

9.2-A’s SME is only that—a SOFTWARE instruction set. And Scalable Matrix Extension calls are executed on ARM CPUs—only

Apple AMX instructions, otoh, have a straight execution path—no?

@dougall
2/2
This, along with SVE2, Realm/CCA, MTE, BTI2… is I think ARM trying to beef up its CPU—albeit at the expense of RISC—in hopes of maintaining the relevance of its venerable CPUs in the face of increasing threats from NVIDIA GPUs, Google TPUs, by NPUs, custom silicon from Amazon—even Meta—and others esp. in the AI/HPC space.

But is it Apple’s job to keep ARM CPUs relevant vis-à-vis competition by non-CPU processors by competitors, or is that ARM’s problem? 🤔 Food for thought anyway…

@dougall

Have any new details come out about Apple’s matrix coprocessor? Is it in “cores” like the CPU, NPU, GPU? (e.g. Apple could beef up its matrix coprocessor just by adding more cores in future SoCs.)

Is there 1 matrix coprocessor per SoC or multiple matrix coppers per SoC?

(One would have to assume a minimum of 2 for Apple’s M1, M2 & M3 “Ultra” SoCs.)

Anyone know its raw performance in relation to other companies’ implementations of a discrete matrix math block?

Is NEON+AMX > SVE2?

Devs moaned and complained re: Apple’s “opaqueness” about its proprietary matrix coprocessor plus its injunction to just write via “Accelerate” and let the framework divvy up the tasks as it sees fit.
Everyone wanted to program the coprocessor directly.
Well whattayaknow!? Looky-looky. Apple was right (again).
With ’s adoption of ARM’s SME software instruction set from ARMv9-A, as of the M4 and A18, any code that writes directly to Apple’s matrix coprocessor will likely break on those machines.