Mastodawn

Alexandre Mutel

Next step I'm implementing matrix * matrix for large matrices with .NET 7 SIMD, starting to be 30-40% faster than Tensorflow, which is 😎 but I'm still 30-40% slower than MKL! 😅
Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //

Alexandre Mutel Mar 28, 2023

I have been trying to tweak my vectorize/multithreaded code of matrix * matrix multiplication (tiling/notiling, trying to maximize more the cache usage), but no matter what, I'm not able to achieve the performance of MKL... not sure what kind of black magic they are using but it's super intriguing/frustrating! 🤔

Alexandre Mutel Mar 31, 2023

Couldn't resist, but I'm starting to have matrix multiplication getting faster than MKL! 😎 I have still to tweak manually some parameters to maximize cache utilization for each matrix size, but now I "just" need to figure out how to calculate these parameters automatically (e.g from L1 cache size...etc.)
First time I have to make an algorithm that takes into account such things, that's pretty interesting and impressive how much it can change the results! 🏎️

nietras 👾Mar 31, 2023

@xoofx that's damn impressive! 🚀

jean-luc May 7, 2023

@xoofx That’s awesome 😀 And you sent me a down a rabbit hole as a performance newbie (but user of MKL and IPP) Without tiling and multi-threading I managed to get 6.5x MKL’s perf on a 500x500 matrix. I read about a technique called register blocking https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0#register-blocking and it said by doing 3x4 blocks we can use all 16 SIMD registers (3 reused in inner loop, 1 to loop over and 3*4 accumulators = 16). That’s what performs best so far but when I look at the disassembly it doesn’t seem to re-use registers as suggested

https://sharplab.io/#gist:67fcbace16c33703b6a6c5c3b59a58f2

Is it possible to make it re-use the registers like in the gist? Wonder whether you used a similar technique

Efficient matrix multiplication

Efficient matrix multiplication. GitHub Gist: instantly share code, notes, and snippets.

Gist

Alexandre Mutel May 8, 2023

@jldr I have only experimented with floats and not doubles, but optimizations should apply similarly. 🙂
Though, getting 6x over MKL seems really suspicious 😉 for the reason that the code at the asm level cannot be really optimized more than MKL, so I would double check (e.g the correctness of the results, which version of MKL is used, how it is configured...etc.)

Alexandre Mutel May 8, 2023

@jldr Secondly, from your sharplab code, you should not use an intermediate struct because you can see that the code is not using registers but the stack (rsp), so the performance here should be impacted severely. Unlike a C++ compiler that will be able to keep things in registers, like if the struct didn't exists, you will need to use local variables instead. In order to see if all xmm registers are used, you should see the registers xmm6 to xmm14 saved at the beginning of your function

jean-luc May 8, 2023

@xoofx I tried without the struct with 12 single Vector256 variables but then gave up since I got the same disasm back 😕 but it sounds like I need to experiment more after your answer maybe I just placed them badly

Oh and I’m so sorry for the misunderstanding. I have 6 times worse perf than MKL of course 😃 (~3 now with 2 threads). I actually wanted to sound humble instead of bragging.

Thanks very much for your answer

jean-luc May 8, 2023

@xoofx Wow that did the trick, placing the vars at the start makes it re-use the registers. That was a great pointer.

@xoofx for small matrixes MathV seems significantly faster though

Neil Henning Mar 28, 2023

@xoofx there are so many tricks in big matrix multiplication to avoid recomputing numbers - could it be that? I’ve read a few papers over the years and it blew my mind heh!

Alexandre Mutel Mar 29, 2023

@neilhenning Yeah, but I have not seen anybody using them for practical BLAS implementations (and unlikely MKL), as they come with lots of constraints that are not suitable for e.g SIMD register pressure.

I really don't know what they are doing. It is quite sad also that it is a closed source library as we can't replicate/learn from it.

Dougall Mar 29, 2023

@xoofx If you have a sampling profiler, it'll probably point you to the main loop in the binary that you're interested in pretty quickly. There's at least a chance it's written in assembly, so you might not be missing much? Not sure, but that's how I'd try to learn from it.

(And IANAL, but I suspect "replicating" is a grey area, regardless of whether or not it's open source.)

Phil Dennis-Jordan Mar 29, 2023

@dougall @xoofx It might also be worth looking at CPU perf counters if that’s possible with .NET processes and compare to MKL’s figures. At minimum it’ll point you towards areas where you can still make significant improvements.

Alexandre Mutel Mar 30, 2023

@dougall yeah, I could, but I won't (for the reason of the grey area 😅 ). It's ok I think, I'm gonna stop here optimizing stuffs, my implementation is not as fast as MKL (for large matrices), but it is still faster than many other implementations out there (e.g OpenBLAS)

nietras 👾Mar 26, 2023

@xoofx weird that TF is that slow. Aren't they (still) using a highly optimized lib underneath?

Alexandre Mutel Mar 28, 2023

@nietras I think they are still using an optimized version (SIMD + MT) otherwise they wouldn't even be able to go that down, but... for some reasons, my version is doing a good job! 😅

Difficult to beat MKL though. I'm gonna keep my current version and move on. Don't want to spend too much time on something that has been optimized for years by others...

nietras 👾Mar 28, 2023

@xoofx definitely good job. You might have looked at ML.NET code too, but not sure how optimized it really is.

https://github.com/dotnet/machinelearning/blob/04dda55ab0902982b16309c8e151f13a53e9366d/src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

machinelearning/CpuMathUtils.netcoreapp.cs at 04dda55ab0902982b16309c8e151f13a53e9366d · dotnet/machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET. - machinelearning/CpuMathUtils.netcoreapp.cs at 04dda55ab0902982b16309c8e151f13a53e9366d · dotnet/machinelearning

GitHub

Christer Ericson Mar 26, 2023

@xoofx maybe compare with BLAS too?

Where does accuracy and precision fall vs speed? (Compare: a function that just immediately returns is a very fast matrix multiplication but neither accurate nor precise.)

Alexandre Mutel Mar 28, 2023

@christer The accuracy is similar between the different versions (MKL, Tensorflow and mine) as they are all using multithreading and FMA instructions. Assuming that MKL should be almost the fastest on the market, unless, e.g OpenBlas is faster?

zahirtezcan Mar 27, 2023

@xoofx iirc there was a multiplication algorithm faster than O(n^3) via tiling. Also loop-tiling (locality) should help a lot.

Alexandre Mutel Mar 27, 2023

@zahirtezcan yeah, I'm already doing that otherwise I wouldn't be able to achieve this performance, but I'm not fulling leveraging cache-line locality behavior, so that's something I need to fix and it should hopefully close the gap with MKL