Next step I'm implementing matrix * matrix for large matrices with .NET 7 SIMD, starting to be 30-40% faster than Tensorflow, which is 😎 but I'm still 30-40% slower than MKL! πŸ˜…
Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //
I have been trying to tweak my vectorize/multithreaded code of matrix * matrix multiplication (tiling/notiling, trying to maximize more the cache usage), but no matter what, I'm not able to achieve the performance of MKL... not sure what kind of black magic they are using but it's super intriguing/frustrating! πŸ€”
Couldn't resist, but I'm starting to have matrix multiplication getting faster than MKL! 😎 I have still to tweak manually some parameters to maximize cache utilization for each matrix size, but now I "just" need to figure out how to calculate these parameters automatically (e.g from L1 cache size...etc.)
First time I have to make an algorithm that takes into account such things, that's pretty interesting and impressive how much it can change the results! 🏎️

@xoofx That’s awesome πŸ˜€ And you sent me a down a rabbit hole as a performance newbie (but user of MKL and IPP) Without tiling and multi-threading I managed to get 6.5x MKL’s perf on a 500x500 matrix. I read about a technique called register blocking https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0#register-blocking and it said by doing 3x4 blocks we can use all 16 SIMD registers (3 reused in inner loop, 1 to loop over and 3*4 accumulators = 16). That’s what performs best so far but when I look at the disassembly it doesn’t seem to re-use registers as suggested

https://sharplab.io/#gist:67fcbace16c33703b6a6c5c3b59a58f2

Is it possible to make it re-use the registers like in the gist? Wonder whether you used a similar technique

Efficient matrix multiplication

Efficient matrix multiplication. GitHub Gist: instantly share code, notes, and snippets.

Gist
@jldr I have only experimented with floats and not doubles, but optimizations should apply similarly. πŸ™‚
Though, getting 6x over MKL seems really suspicious πŸ˜‰ for the reason that the code at the asm level cannot be really optimized more than MKL, so I would double check (e.g the correctness of the results, which version of MKL is used, how it is configured...etc.)
@jldr Secondly, from your sharplab code, you should not use an intermediate struct because you can see that the code is not using registers but the stack (rsp), so the performance here should be impacted severely. Unlike a C++ compiler that will be able to keep things in registers, like if the struct didn't exists, you will need to use local variables instead. In order to see if all xmm registers are used, you should see the registers xmm6 to xmm14 saved at the beginning of your function

@xoofx I tried without the struct with 12 single Vector256 variables but then gave up since I got the same disasm back πŸ˜• but it sounds like I need to experiment more after your answer maybe I just placed them badly

Oh and I’m so sorry for the misunderstanding. I have 6 times worse perf than MKL of course πŸ˜ƒ (~3 now with 2 threads). I actually wanted to sound humble instead of bragging.

Thanks very much for your answer