@pkhuong Looks neat, not sure I can really understand it well tho :P.
I'll just stick to Bakery Algorithm :D.
@pkhuong I can't follow all of it, but I wonder if this bit:
> It can also be helpful to use cache line-wide stores (e.g., AVX-512 or FSRM stores) to avoid reads for ownership.
should be clarified? First, wouldn't a full line write need a "no data RFO" in any case? Also, do you know if any CPUs implement this optimization for avx512? As I recall, Skylakes didn't use rfo_nodata for aligned avx512 writes, and I haven't heard that changed since.
@amonakov Yeah, you're right, I'll clarify to avoiding a full RFO. The chip still needs to send a read request for ownership, but no data.
Re the AVX-512 path, I'm pretty sure it works on SPR+ (but can be hard to confirm with all the prefetchers).