Next step I'm implementing matrix * matrix for large matrices with .NET 7 SIMD, starting to be 30-40% faster than Tensorflow, which is ๐Ÿ˜Ž but I'm still 30-40% slower than MKL! ๐Ÿ˜…
Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //
I have been trying to tweak my vectorize/multithreaded code of matrix * matrix multiplication (tiling/notiling, trying to maximize more the cache usage), but no matter what, I'm not able to achieve the performance of MKL... not sure what kind of black magic they are using but it's super intriguing/frustrating! ๐Ÿค”
Couldn't resist, but I'm starting to have matrix multiplication getting faster than MKL! ๐Ÿ˜Ž I have still to tweak manually some parameters to maximize cache utilization for each matrix size, but now I "just" need to figure out how to calculate these parameters automatically (e.g from L1 cache size...etc.)
First time I have to make an algorithm that takes into account such things, that's pretty interesting and impressive how much it can change the results! ๐ŸŽ๏ธ
@xoofx that's damn impressive! ๐Ÿš€

@xoofx Thatโ€™s awesome ๐Ÿ˜€ And you sent me a down a rabbit hole as a performance newbie (but user of MKL and IPP) Without tiling and multi-threading I managed to get 6.5x MKLโ€™s perf on a 500x500 matrix. I read about a technique called register blocking https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0#register-blocking and it said by doing 3x4 blocks we can use all 16 SIMD registers (3 reused in inner loop, 1 to loop over and 3*4 accumulators = 16). Thatโ€™s what performs best so far but when I look at the disassembly it doesnโ€™t seem to re-use registers as suggested

https://sharplab.io/#gist:67fcbace16c33703b6a6c5c3b59a58f2

Is it possible to make it re-use the registers like in the gist? Wonder whether you used a similar technique

Efficient matrix multiplication

Efficient matrix multiplication. GitHub Gist: instantly share code, notes, and snippets.

Gist
@jldr I have only experimented with floats and not doubles, but optimizations should apply similarly. ๐Ÿ™‚
Though, getting 6x over MKL seems really suspicious ๐Ÿ˜‰ for the reason that the code at the asm level cannot be really optimized more than MKL, so I would double check (e.g the correctness of the results, which version of MKL is used, how it is configured...etc.)
@jldr Secondly, from your sharplab code, you should not use an intermediate struct because you can see that the code is not using registers but the stack (rsp), so the performance here should be impacted severely. Unlike a C++ compiler that will be able to keep things in registers, like if the struct didn't exists, you will need to use local variables instead. In order to see if all xmm registers are used, you should see the registers xmm6 to xmm14 saved at the beginning of your function

@xoofx I tried without the struct with 12 single Vector256 variables but then gave up since I got the same disasm back ๐Ÿ˜• but it sounds like I need to experiment more after your answer maybe I just placed them badly

Oh and Iโ€™m so sorry for the misunderstanding. I have 6 times worse perf than MKL of course ๐Ÿ˜ƒ (~3 now with 2 threads). I actually wanted to sound humble instead of bragging.

Thanks very much for your answer

@xoofx Wow that did the trick, placing the vars at the start makes it re-use the registers. That was a great pointer.
@xoofx for small matrixes MathV seems significantly faster though
@xoofx there are so many tricks in big matrix multiplication to avoid recomputing numbers - could it be that? Iโ€™ve read a few papers over the years and it blew my mind heh!

@neilhenning Yeah, but I have not seen anybody using them for practical BLAS implementations (and unlikely MKL), as they come with lots of constraints that are not suitable for e.g SIMD register pressure.

I really don't know what they are doing. It is quite sad also that it is a closed source library as we can't replicate/learn from it.

@xoofx If you have a sampling profiler, it'll probably point you to the main loop in the binary that you're interested in pretty quickly. There's at least a chance it's written in assembly, so you might not be missing much? Not sure, but that's how I'd try to learn from it.

(And IANAL, but I suspect "replicating" is a grey area, regardless of whether or not it's open source.)

@dougall @xoofx It might also be worth looking at CPU perf counters if thatโ€™s possible with .NET processes and compare to MKLโ€™s figures. At minimum itโ€™ll point you towards areas where you can still make significant improvements.
@dougall yeah, I could, but I won't (for the reason of the grey area ๐Ÿ˜… ). It's ok I think, I'm gonna stop here optimizing stuffs, my implementation is not as fast as MKL (for large matrices), but it is still faster than many other implementations out there (e.g OpenBLAS)
@xoofx weird that TF is that slow. Aren't they (still) using a highly optimized lib underneath?

@nietras I think they are still using an optimized version (SIMD + MT) otherwise they wouldn't even be able to go that down, but... for some reasons, my version is doing a good job! ๐Ÿ˜…

Difficult to beat MKL though. I'm gonna keep my current version and move on. Don't want to spend too much time on something that has been optimized for years by others...

@xoofx definitely good job. You might have looked at ML.NET code too, but not sure how optimized it really is.

https://github.com/dotnet/machinelearning/blob/04dda55ab0902982b16309c8e151f13a53e9366d/src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

machinelearning/CpuMathUtils.netcoreapp.cs at 04dda55ab0902982b16309c8e151f13a53e9366d ยท dotnet/machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET. - machinelearning/CpuMathUtils.netcoreapp.cs at 04dda55ab0902982b16309c8e151f13a53e9366d ยท dotnet/machinelearning

GitHub

@xoofx maybe compare with BLAS too?

Where does accuracy and precision fall vs speed? (Compare: a function that just immediately returns is a very fast matrix multiplication but neither accurate nor precise.)

@christer The accuracy is similar between the different versions (MKL, Tensorflow and mine) as they are all using multithreading and FMA instructions. Assuming that MKL should be almost the fastest on the market, unless, e.g OpenBlas is faster?
@xoofx iirc there was a multiplication algorithm faster than O(n^3) via tiling. Also loop-tiling (locality) should help a lot.
@zahirtezcan yeah, I'm already doing that otherwise I wouldn't be able to achieve this performance, but I'm not fulling leveraging cache-line locality behavior, so that's something I need to fix and it should hopefully close the gap with MKL