Mastodawn

this is such a good debugging story ("Rust std fs slower than Python!? No, it's hardware!”) https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/

Rust std fs slower than Python!? No, it's hardware!

An infrastructure engineer, focused on distributed storage system

Show thread

rf Nov 29, 2023

@b0rk Fascinating to think about what could've led to the underlying hw perf issue. Maybe AMD initially had a bug with page-aligned copying, and what we're seeing is the slow code with a mitigation/workaround?

Show thread

Francis 🏴‍☠️ Gulotta Nov 29, 2023

@b0rk this is so good

Show thread

Rev. GothAlice Nov 29, 2023

@b0rk Excellent in-depth dive into how to investigate these types of issues! strace helped my team identify that Python was segfaulting due to UCS-4 string use attempting to allocate space for a >57KiB regular expression.

(Switch the internal representation to UCS-8, the problem goes away. Only happened in dev/test, 'cause only in dev/test do we enable every single route in the entire application, resulting in that monster regex. Thanks, Django.)

strace is love.
strace is life.

Show thread

Rev. GothAlice Nov 29, 2023

@b0rk But this also touches on a peculiarity of Python I've mentioned before. There are edge cases where Python is legitimately faster than equivalent C code. Memory pre-allocation including no-mmove array/list growth, complex dead code removal, and so forth.

Python cheats.
As much as it can possibly get away with.

We joke that Python's hash table implementation powering sets, dictionaries, &c. couldn't be optimized further without risking the creation of a singularity.

Show thread

Rev. GothAlice Nov 29, 2023

"Complex dead code removal" being, for example, the deletion of the entirety of this function, becoming a no-op.

def do():
n = 0
for i in range(27*10**42):
n += 2

n is never used, so the loop is irrelevant and just goes away. (Pypy JIT to blame for this one.)

C… would iterate 27 tredecillion times to do nothing.

Show thread

Smoljaguar Nov 29, 2023

@alice C compilers (when optimisation is turned on) can be even smarter, e.g. https://godbolt.org/z/Y7q8PE485 shows GCC turning something similar to your loop into just a constant return. Clang can do even crazier stuff like turning sum over integers into n(n+1)/2 (slightly different to prevent overflow): https://godbolt.org/z/PPKsn1Y8q. More similar stuff here: https://www.youtube.com/watch?v=bSkpMdDe4g4

Compiler Explorer - C (x86-64 gcc (trunk))

int do_stuff(void){ int n = 0; for (int i = 0; i < (1 << 29); i+=3){ n+=2; } return n; }

Show thread

Rev. GothAlice Nov 29, 2023

@Smoljaguar Indeed; compiler tech (especially LLVM-derived stuff) has come a long way since the original "Pypy faster than C (for a carefully crafted example)" articles. I'm old. 😜 "Constant expression" optimization was a fun thing to dabble with in my own esolang. (In my case, helped by expression dependency graphing; if all dependencies are known at compile time, that expression be a constant.)

Show thread

Victor Stinner 🐍Nov 29, 2023

@b0rk It's a I/O-bound benchmark! Oh wait, in fact, it's a memcpy() benchmark. Oh wait, in fact it's a REP MOVSB microbenchmark. Oh wait, it's a CPU FSRM (Fast Short REP MOV) microcode micromicrobenchmark! Great write-up.

Show thread

Peter Bindels Nov 30, 2023

@b0rk yeah, but it still has me puzzled. The difference is in the *starting* offset of the operation only. Why would copying 4k from offset 0x30 be faster than copying 4k from offset 0x10 ?

I don't get it.

Show thread

Zhenbo Li Nov 30, 2023

@dascandy42 @b0rk

The author, @xuanwo reached the conclusion that there is a bug in #AMD Ryzen 9 5900X