this is such a good debugging story ("Rust std fs slower than Python!? No, it's hardware!ā€) https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
Rust std fs slower than Python!? No, it's hardware!

An infrastructure engineer, focused on distributed storage system

@b0rk Fascinating to think about what could've led to the underlying hw perf issue. Maybe AMD initially had a bug with page-aligned copying, and what we're seeing is the slow code with a mitigation/workaround?

@b0rk Excellent in-depth dive into how to investigate these types of issues! strace helped my team identify that Python was segfaulting due to UCS-4 string use attempting to allocate space for a >57KiB regular expression.

(Switch the internal representation to UCS-8, the problem goes away. Only happened in dev/test, 'cause only in dev/test do we enable every single route in the entire application, resulting in that monster regex. Thanks, Django.)

strace is love.
strace is life.

@b0rk But this also touches on a peculiarity of Python I've mentioned before. There are edge cases where Python is legitimately faster than equivalent C code. Memory pre-allocation including no-mmove array/list growth, complex dead code removal, and so forth.

Python cheats.
As much as it can possibly get away with.

We joke that Python's hash table implementation powering sets, dictionaries, &c. couldn't be optimized further without risking the creation of a singularity.

"Complex dead code removal" being, for example, the deletion of the entirety of this function, becoming a no-op.

def do():
n = 0
for i in range(27*10**42):
n += 2

n is never used, so the loop is irrelevant and just goes away. (Pypy JIT to blame for this one.)

C… would iterate 27 tredecillion times to do nothing.

@alice C compilers (when optimisation is turned on) can be even smarter, e.g. https://godbolt.org/z/Y7q8PE485 shows GCC turning something similar to your loop into just a constant return. Clang can do even crazier stuff like turning sum over integers into n(n+1)/2 (slightly different to prevent overflow): https://godbolt.org/z/PPKsn1Y8q. More similar stuff here: https://www.youtube.com/watch?v=bSkpMdDe4g4
Compiler Explorer - C (x86-64 gcc (trunk))

int do_stuff(void){ int n = 0; for (int i = 0; i < (1 << 29); i+=3){ n+=2; } return n; }

@Smoljaguar Indeed; compiler tech (especially LLVM-derived stuff) has come a long way since the original "Pypy faster than C (for a carefully crafted example)" articles. I'm old. 😜 "Constant expression" optimization was a fun thing to dabble with in my own esolang. (In my case, helped by expression dependency graphing; if all dependencies are known at compile time, that expression be a constant.)
@b0rk It's a I/O-bound benchmark! Oh wait, in fact, it's a memcpy() benchmark. Oh wait, in fact it's a REP MOVSB microbenchmark. Oh wait, it's a CPU FSRM (Fast Short REP MOV) microcode micromicrobenchmark! Great write-up.

@b0rk yeah, but it still has me puzzled. The difference is in the *starting* offset of the operation only. Why would copying 4k from offset 0x30 be faster than copying 4k from offset 0x10 ?

I don't get it.

@dascandy42 @b0rk

The author, @xuanwo reached the conclusion that there is a bug in #AMD Ryzen 9 5900X