Hey, finally! I just published a new - and real - blog post "10x Performance with SIMD Vectorized Code in C#/.NET" https://xoofx.com/blog/2023/07/09/10x-performance-with-simd-in-csharp-dotnet/ πŸŽ‰

That was a quick write up, so my apologize for the poor phrasing, after 3 years without writing a blog post, I feel rusty. But it feels good to share again! πŸ€— #dotnet

10x Performance with SIMD Vectorized Code in C#/.NET | xoofx

@xoofx great post, this is basically Sep πŸ‘is there a reason for doing permute *before* pack unsigned saturate?

@nietras oh, good catch! πŸ™‚ No reasons, I think I missed the fact that I could use _mm256_permute4x64_epi64 after instead of performing the swap before. It helps saving 3 permutes in the end, not bad! Thanks for the suggestion.

I have updated the blog post and added a link to Sep at the end of the article: That's actually a good example of real world usage of intrinsics for performance benefits! πŸ˜‰

@xoofx thanks ☺️
@xoofx of course the generic versions can be improved too given there is ExtractMostSignificantBits (generic move mask) and not sure pack saturated is needed, can just Narrow. So should actually be able to make this fully generic. I think πŸ˜…
@nietras yeah, definitely could be, I took the original code without digging further, anyway, I'm back to my holidays, I won't check that until next week! πŸ˜ŽπŸ–οΈ