Calling it! Between the SDK, Xcode, Geekbench flag detection & more, Apple has officially brought the first Arm Scalable Matrix Extension enabled device to market in the new #M4. With that also comes the move from NEON/ASIMD to the Streaming Scalable Vector Extensions
#SME #SSVE

What does this mean? It means that we now have a dedicated Matrix ASIC that can be used via standard opcodes/compilers, available to anyone with a relevant toolchain and compiler.

For the most part, expect all of your #BLAS kernels to gain support over time!

For #HPC in contrast with most matrix tile implementations, we have spec mandated single and double precision support.

That's in contrast with the x86 AMX extensions, most consumer dGPU implementations etc. which are 19 bits and below.

If the implementation is anything like previous apple matrix co-processor designs, I expect TFLOPS class dense DGEMM capabilities while drawing sub 20W and only pinning 1 core per sub cluster (those are all very, very good things)

As for SSVE, think of it like a higher throughput, higher latency, subset of the existing ARM SVE and SVE 2 extensions.

As @st01014 correctly asks, we don't yet know how wide the SSVE implementation will be: https://mast.hpc.social/@st01014/112411470296867720

ST01014 (@[email protected])

@fclc Got a few more screenshots/links ?! BTW, do we also get >128b SSVE ?!

HPC.social Mastodon

As a *complete* guess, assuming the implementation is similar to prior devices, I'd expect the #M4 #SME and #SSVE implementations to be 512 bits wide.

It's simultaneously possible that Apple doubled the width relative to previous incarnations, and is shipping a 1024b implementation

While using geekbench as any sort of hardcore data source is a fools errand, a doubling of performance for certain operations that would be compute bound in SME are seeing ~2x uplifts on M4.

Combine this with #Arm removing support for non (2^N)*128 implementations sizes, and the spec being for intervals of 128->2048 bits wide, and you realistically have the option of 256b, 512b, or 1024b implementations

What else?

I do wonder if #MTE will be exposed this time around.

Other questions? Do we *also* get real (read: full) #SVE and #SVE2
if so, how wide per core?

There's also the possibility of #sme2 but that's probably a step too far

@fclc What are the performance consequences of MTE? I imagine that checking memory tags is not zero cost. Also: aren't Apple pushing hard for more memory safe languages where MTE won't make much of a difference?

@dneary AFAIK MTE has been around for a few generations, but not exposed to SW.

Could be a case where the performance/cost tradeoff wasn't where apple wanted it, and has been iterated a few times over?

@fclc MTE is an Arm 8.5 feature - still not a lot of 8.5+ products on the market - Neoverse N2+ and V2+, and various Apple Ax chips. Has anyone exposed it in software yet?
@dneary @fclc Google Pixel 8 series have it exposed in developer options, and when flashed with GrapheneOS it’s enabled almost universally.
@SolTwoOne @fclc Apologies, I should have been clear: outside of the Android world.
@fclc judging by FEAT_* strings in the ipados kernel probably SME2 but no SVE/SVE2. cpu cores are according to the ADT still sawtooth/everest (same as m3) so probably not much changes beyond turning amx into SME(2)