Mastodawn

Fabian Giesen May 21, 2025

What's that mysterious workaround?

Core Huff6 decode step is described in https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/

A customer managed to get a fairly consistent repro for transient decode errors by overclocking an i7-14700KF by about 5% from stock settings ("performance" multiplier 56->59).

It took weeks of back and forth and forensic debugging to figure out what actually happens, but TL;DR: the observed decode errors are all consistent with a single instruction misbehaving.

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the las…

The ryg blog

Show thread

Fabian Giesen May 21, 2025

This instruction:
mov [rDest + <index>], ch

under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.

(this was "fun" to debug.)

The workaround: when we detect Raptor Lake CPUs, we now do

shr ecx, 8
mov [rDest + <index>], cl

instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.

Show thread

🇺🇦 haxadecimal 🚫👑May 21, 2025

@rygorous
Or don't overclock?
But definitely mad props on the detective work.

Show thread

Per Vognsen May 22, 2025

@brouhaha @rygorous Fabian can clarify further but I thought it did happen without overclocking, just with extreme rarity? The overclocking is used to make it reproducible in a lab setting.

Show thread

Fabian Giesen May 22, 2025

@pervognsen @brouhaha The semi-consistent repro that one customer managed (I still have never seen it live) uses mild overclocking.

The symptoms are the same as on non-overclocked machines seeing the crashes, whether it's actually the same thing we simply do not know.

Show thread

Tom Forsyth May 22, 2025

@rygorous @pervognsen @brouhaha "That's my secret Cap. I'm always overclocked."

Show thread

Tom Forsyth

@rygorous @pervognsen @brouhaha But yeah, the line between "overclocked" and "binned" has been fuzzy for many years. I think the only novel thing here is the faster degradation over time, which means that parts become mis-binned a lot quicker.

Show thread

Jef Poskanzer May 22, 2025

@TomF @rygorous @pervognsen @brouhaha Tempted to underclock my next system. Correct is way better than slightly faster and wrong.

Show thread

Fabian Giesen May 23, 2025

@jef @TomF @pervognsen @brouhaha Unlikely to help, if you're on one of the older microcode revisions, FWIW. The changes they've been making have something to do with the frequency boosting behavior and if you underclock the CPU, that'll just give it even more thermal and power headroom and might actually make the problem worse.

Maybe if you did that and also turn off turbo boost entirely, but at that point you're grossly overpaying for the CPU power you're actually getting.

Show thread

Fabian Giesen May 23, 2025

@jef @TomF @pervognsen @brouhaha My recommendation (not speaking for my employer or anyone else here) is to either get Intel 12th gen (Alder Lake), the newer ones (Arrow Lake), or one of the AMD Zen CPUs. Just avoid Intels 13th/14th gen CPUs entirely. They might be a lot better now, but the fact that they're still regularly putting out new uCode revisions trying to work around this tells me that there's something fundamentally broken with that entire generation.

Show thread

Graham Sutherland / Polynomial May 23, 2025

@TomF @rygorous @pervognsen @brouhaha it's also complicated by the fact that we now have at least three clock scaling states, typically four or more, with complex scaling interactions determined by outside factors like thermals, VRM power targets, adjacent core load, and config values set by the motherboard, not to mention production variances in the hardware. it's practically a recipe for nasty faults where a particular load pattern coincides with a particular transient hardware state.