Not that bad for a first try, but I will have to dig further if 1) I can optimize things further with some fancy AVX2 instructions, 2) If I can improve cache locality usage when going //
@xoofx Thatโs awesome ๐ And you sent me a down a rabbit hole as a performance newbie (but user of MKL and IPP) Without tiling and multi-threading I managed to get 6.5x MKLโs perf on a 500x500 matrix. I read about a technique called register blocking https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0#register-blocking and it said by doing 3x4 blocks we can use all 16 SIMD registers (3 reused in inner loop, 1 to loop over and 3*4 accumulators = 16). Thatโs what performs best so far but when I look at the disassembly it doesnโt seem to re-use registers as suggested
https://sharplab.io/#gist:67fcbace16c33703b6a6c5c3b59a58f2
Is it possible to make it re-use the registers like in the gist? Wonder whether you used a similar technique