Mastodawn

R2DHue Mar 24, 2025

@dougall So, since your famous (infamous) reverse engineering and analysis of Apple’s AMX for the M1 & A14, have you discovered any hardware improvements to this unit up through the M4 & A18?

“Multicore” AMX would be too wasteful/expensive in die area/power/thermal/complexity respects, bt how about notable architectural improvements 2its internal logic units: wider paths, more lanes, on-chip SRAM or DMA — things that cannot be accounted for by die process shrink and higher clock speeds alone(?)

R2DHue Mar 17, 2025

@dougall

Have any new details come out about Apple’s matrix coprocessor? Is it in “cores” like the CPU, NPU, GPU? (e.g. Apple could beef up its matrix coprocessor just by adding more cores in future SoCs.)

Is there 1 matrix coprocessor per SoC or multiple matrix coppers per SoC?

(One would have to assume a minimum of 2 for Apple’s M1, M2 & M3 “Ultra” SoCs.)

Anyone know its raw performance in relation to other companies’ implementations of a discrete matrix math block?

Is NEON+AMX > SVE2?

R2DHue Mar 17, 2025

Devs moaned and complained re: Apple’s “opaqueness” about its proprietary matrix coprocessor plus its injunction to just write via “Accelerate” and let the framework divvy up the tasks as it sees fit.
Everyone wanted to program the coprocessor directly.
Well whattayaknow!? Looky-looky. Apple was right (again).
With ’s adoption of ARM’s SME software instruction set from ARMv9-A, as of the M4 and A18, any code that writes directly to Apple’s matrix coprocessor will likely break on those machines.