Next step I'm implementing matrix * matrix for large matrices with .NET 7 SIMD, starting to be 30-40% faster than Tensorflow, which is 😎 but I'm still 30-40% slower than MKL! 😅
Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //
I have been trying to tweak my vectorize/multithreaded code of matrix * matrix multiplication (tiling/notiling, trying to maximize more the cache usage), but no matter what, I'm not able to achieve the performance of MKL... not sure what kind of black magic they are using but it's super intriguing/frustrating! 🤔
@xoofx there are so many tricks in big matrix multiplication to avoid recomputing numbers - could it be that? I’ve read a few papers over the years and it blew my mind heh!

@neilhenning Yeah, but I have not seen anybody using them for practical BLAS implementations (and unlikely MKL), as they come with lots of constraints that are not suitable for e.g SIMD register pressure.

I really don't know what they are doing. It is quite sad also that it is a closed source library as we can't replicate/learn from it.

@xoofx If you have a sampling profiler, it'll probably point you to the main loop in the binary that you're interested in pretty quickly. There's at least a chance it's written in assembly, so you might not be missing much? Not sure, but that's how I'd try to learn from it.

(And IANAL, but I suspect "replicating" is a grey area, regardless of whether or not it's open source.)

@dougall @xoofx It might also be worth looking at CPU perf counters if that’s possible with .NET processes and compare to MKL’s figures. At minimum it’ll point you towards areas where you can still make significant improvements.