@aeva so, x=0,..,3 y=0 are all good, these are all adjacent, straight shot, read 16 consecutive bytes, great.
x=0,...,3 y=1 in threads 16..19 are also good, these are the next 16 bytes in memory.
But if we have 256-byte cache lines (another Totally Hypothetical Number), well, those 32 bytes are all we get.
x=4,..,7 for y=0 and 1 are in the cache line at offset 256, x=8,...,11 for y=0,1 at offset 512, x=12,...,15 at offset 768.

