Still can't understand the strange GPU memory alignment and read/write speed gacha game. When malloc() gives you a SSR pointer, the speed is 740 GB/s, otherwise it's 620 GB/s. I now suspect it's not just alignment but a memory channel/bank interleaving effect. Depending on the location of the array, the DRAM channels/bank that have a chance to interleave to participate in a transaction jump up and down. Unfortunately AMD does not have documentation for Vega 20.
@niconiconi faulty banks maybe?
@[email protected] I don't think there are faulty banks. If I process array A, B, and C in three separate compute kernels, it always reaches peak performance. But if I process A, B, C in the same kernel, I see this kind of performance variations, likely related to memory address generation patterns.

AMD's optimization manual has an "address offset vs channel interleave probability" table, but only for the ancient GCN 1.0 Radeon 7870 and 7970 GPUs. No information about newer generations like Vega.
@niconiconi well yes, the exact layout and interleaving should be in documentation.