@aeva well, suppose we're working in 32-thread waves internally (totally hypothetical number)
now those 32 invocations get (in the very first thread group) x=0,...,15 for y=0 and then y=1.
Say the image is R8G8B8A8 pixels and the internal image layout stores aligned groups of 4 texels next to each other and then goes to the next y, and the next 4-wide strip of texels is actually stored something like 256 bytes away or whatever.

