Mastodawn

Alexandre Mutel Mar 26, 2023

Next step I'm implementing matrix * matrix for large matrices with .NET 7 SIMD, starting to be 30-40% faster than Tensorflow, which is 😎 but I'm still 30-40% slower than MKL! 😅
Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //

Show thread

Alexandre Mutel Mar 28, 2023

I have been trying to tweak my vectorize/multithreaded code of matrix * matrix multiplication (tiling/notiling, trying to maximize more the cache usage), but no matter what, I'm not able to achieve the performance of MKL... not sure what kind of black magic they are using but it's super intriguing/frustrating! 🤔

Show thread

Neil Henning Mar 28, 2023

@xoofx there are so many tricks in big matrix multiplication to avoid recomputing numbers - could it be that? I’ve read a few papers over the years and it blew my mind heh!

Show thread

Alexandre Mutel Mar 29, 2023

@neilhenning Yeah, but I have not seen anybody using them for practical BLAS implementations (and unlikely MKL), as they come with lots of constraints that are not suitable for e.g SIMD register pressure.

I really don't know what they are doing. It is quite sad also that it is a closed source library as we can't replicate/learn from it.

Show thread

Dougall Mar 29, 2023

@xoofx If you have a sampling profiler, it'll probably point you to the main loop in the binary that you're interested in pretty quickly. There's at least a chance it's written in assembly, so you might not be missing much? Not sure, but that's how I'd try to learn from it.

(And IANAL, but I suspect "replicating" is a grey area, regardless of whether or not it's open source.)

Show thread

Phil Dennis-Jordan

@dougall @xoofx It might also be worth looking at CPU perf counters if that’s possible with .NET processes and compare to MKL’s figures. At minimum it’ll point you towards areas where you can still make significant improvements.