Friendly reminder to never read from write-combined memory (e.g. upload buffers) because that absolutely kills performance.

Note that you can run into this even if your C++ code is just a bunch of straight writes, if the compiler generates the wrong code. I ran into this with the D3D12_RAYTRACING_INSTANCE_DESC struct, because it contains bitfields, and the compiler generated XOR & AND to memory destination. So that's a read-modify-write operation. Probably relatively harmless on regular memory.

The solution was to fill a D3D12_RAYTRACING_INSTANCE_DESC stack variable and then memcpy() it in place. That actually also generated far more straightforward code that did was you expected. No actual stack was used, it just combined the data in registers and wrote to destination with four straight movups writes.

The result of this was that this 1.3ms pass now takes 0.1ms, which is closer to what I would've expected.

@Humus ah memories of write gather pipes and command buffer filling.
@Humus Thanks for sharing 🙂