The other memcpys C into a local array and then accumulates onto that before memcpying it back at the end of the kernel. The latter is thus 100% equivalent to the std::simd kernel, except that the compiler needs to turn the innermost loop into the SIMD FMA that std::simd encodes directly.

This is ~3–4x slower.

TBH, I expected less of a difference.
But anyway, if you want to express data-parallelism don't write a loop, use std::simd. It helps.

2/2

#stdsimd

Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).

This is ~10–30x slower.

1/2

#stdsimd #cpp26 #simd

mdspan rocks! A simple switch to go from layout_right to layout_right_padded and performance for larger matrices goes 📈 up! (e.g. 4096×4096 from 76GFLOP/s to 100GFLOP/s) I introduced padding of one cache line between rows to avoid cache associativity virtually reducing cache sizes.
For small matrices the extra padding is counterproductive, though. But mdspan abstracts it all away. The matrix-mul function is unchanged.

#stdsimd #mdspan #cpp26 #optimization

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

#stdsimd #simd #mdspan #cpp26 #cpp

RE: https://fediscience.org/@danielskatz/116453917891423889

🎉 Seems like my #stdsimd work will finally be easier to recognize as research output.

I'm a bit sad today. Yesterday I pushed https://forge.sourceware.org/gcc/gcc-mirror/commit/804bde962de4819138951aed24b2c8ba768d7344, which makes a simple `x + 1` ill-formed: https://compiler-explorer.com/z/4rYx87fcW. Now, in generic code, you write `+ std::cw<1>` instead. If you know the value-type (`float` in this case), just use the appropriate literal (if it exists): `x + 1.f`.

#stdsimd #cpp26

libstdc++: Implement P4012R1 while reverting P3844R2 (consteval simd broadcast) · 804bde962d

P3844R2 added consteval conversion for value-preserving conversion from constants. It had been approved by LEWG in Kona. Therefore, the current implementation has the consteval broadcast constructor. In Croydon, LEWG reversed the decision but changed the overload set to keep the design space ope...

Sourceware Forge: Core Toolchain and Developer Tools
Pro tip for stdx::simd users (the same will be true for std::simd): When you use the generator ctor, don't use a generic lambda unless you *need* it. A generic lambda leads to template bloat — and why pay for something you don't need? If you need a constexpr index, then it's not template bloat, but a necessary instantiation.
I have some homework to do, and audit my own code for unnecessary bloat.
#stdsimd #stdxsimd #Cpp #CPlusPlus
Matthias Kretz @mkretz rocking the C++ world — with data level parallelism!
https://www.youtube.com/watch?v=8xnPsPdy5AQ
#EUGRDays24 #stdsimd #cplusplus
European GNU Radio Days 2024 at GSI/FAIR in Darmstadt, Germany

YouTube