https://godbolt.org/z/b66vhKb1f
Is there anything actually stopping the compilers to optimize wide_load1 into actually doing a wide load? As I understand the C++ memory model, relaxed atomic load should be reorderable with other loads.

Compiler Explorer - C++
uint8_t arr[4]; std::atomic<uint32_t> a1, a2; uint32_t wide_load1() { uint32_t ret = 0; ret |= arr[3] << 24; a1.load(std::memory_order_relaxed); ret |= arr[2] << 16; ret |= arr[1] << 8; ret |= arr[0] << 0; return ret; } uint32_t wide_load2() { uint32_t ret = 0; ret |= arr[3] << 24; ret |= arr[2] << 16; ret |= arr[1] << 8; ret |= arr[0] << 0; return ret; }