Mastodawn

Fabian Giesen May 21, 2025

What's that mysterious workaround?

Core Huff6 decode step is described in https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/

A customer managed to get a fairly consistent repro for transient decode errors by overclocking an i7-14700KF by about 5% from stock settings ("performance" multiplier 56->59).

It took weeks of back and forth and forensic debugging to figure out what actually happens, but TL;DR: the observed decode errors are all consistent with a single instruction misbehaving.

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the las…

The ryg blog

This instruction:
mov [rDest + <index>], ch

under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.

(this was "fun" to debug.)

The workaround: when we detect Raptor Lake CPUs, we now do

shr ecx, 8
mov [rDest + <index>], cl

instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.

Fabian Giesen May 21, 2025

In the Huffman decoding loops, this is a fairly minor perf impact on Raptor Lake (0.1-0.3% slower over the full Kraken decode, usually). It sucks on old (pre-Haswell) parts but they take a different code path already.

We also use TANS in some cases (rarely Kraken, more often in Leviathan) and in those kernels the extra shift does hurt. We've seen up to 20% slow-down from the extra insn per byte in pathological cases but in practice it's usually more like 0.5% for typical data.

Jim Kjellin May 21, 2025

@rygorous why not replicate value to ensure cl=ch?
didn't work or slower?

Fabian Giesen May 21, 2025

@jimk Values are from a table. CL and CH contains different values!

Fabian Giesen May 21, 2025

@jimk CL contains the Huffman code length (needs to be in CL for legacy x86 because "shr reg, cl" is the only variable shift) and CH is the corresponding value

Shane Celis May 21, 2025

@rygorous Wow. That’s some bug.

Maciej Sinilo May 21, 2025

@rygorous "when we detect Raptor Lake CPUs, " -- do you run different code at different CPU models? how high up is the split (if not a secret)

Fabian Giesen May 21, 2025

@msinilo yes, and right before the loops in question.

Our CPU detect puts out a generic feature flag for "this is Raptor Lake" and the dispatcher then does

if (rrCPUx86_feature_present(RRX86_CPU_BMI2))
{
if (rrCPUx86_feature_present(RRX86_CPU_RAPTOR_LAKE))
{
ok = tansx2_x64_bmi2_rpl_asm(&s);
}
else
{
ok = tansx2_x64_bmi2_asm(&s);
}
}
else
{
ok = tansx2_x64_asm(&s);
}

Petr Tesařík May 26, 2025

@rygorous @msinilo The conditional on each execution is annoying me (it takes up slots in the branch predictor) . Have you considered patching code at initialization time?

Fabian Giesen May 27, 2025

@ptesarik @msinilo it's completely irrelevant

Fabian Giesen May 21, 2025

@msinilo We don't have many of those.

Generally it's just broad feature levels like the BMI2 vs. not.

The Huffman decoders have a lot of special cases though:
- AMD Jaguar 64-bit (for PS4/Xb1)
- AMD Zen 2+ 64-bit (for PS5/XSX, but also used on desktop Zen CPUs)
- BMI x86 64-bit
- generic x86 64-bit
- x86 32-bit with SSE2
- pre-SSE2 x86 32-bit

and that's just x86, ARM has more (4 Huffman decoder kernels: Cortex-A55 class, Cortex-A57/A72 class, Cortex-A78+ class, M1+-class)

Fabian Giesen May 21, 2025

@msinilo it is just those loops and they're fundamentally dead simple, they're just decoding Huffman-coded bytes using a 11-bit length-limited code in a particular stream layout.

Having this be standalone and not interleaved with other processing lets us specialize a lot more than we otherwise would.

camelcdr May 21, 2025

@rygorous Which BMI2 instructions do you use?

Fabian Giesen May 21, 2025

@camelcdr just SHRX. 1 uop for variable-distance shift on Haswell and up vs. 3 uops on most Intel CPUs in that range for "SHR reg, cl".

Fabian Giesen May 21, 2025

@camelcdr if you have an unrolled 4-instruction loop per byte, and one of those four instructions is a variable shift, that shift taking 1/3rd the backend resources is very noticeable.

zwarich May 21, 2025

@rygorous now *that's* a false subregister dependency 👌

Fabian Giesen May 21, 2025

@zwarich this is likely served from load-to-store forwarding on the bypass network, our best guess is that there's a mux somewhere to select bits [15:8] instead of [7:0] from the source (the load itself is a 16-bit load) and its control signal is, apparently, timing critical. The underlying Vmin shift issue https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239 causes clock skew so any control signal with little timing slack is in the danger zone

Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the root cause diagnosis for the issue. This post will cover Intel’s understanding of the root cause, as well as additional mitigations and next steps for Int...

🇺🇦 haxadecimal 🚫👑May 21, 2025

@rygorous
Or don't overclock?
But definitely mad props on the detective work.

Per Vognsen May 22, 2025

@brouhaha @rygorous Fabian can clarify further but I thought it did happen without overclocking, just with extreme rarity? The overclocking is used to make it reproducible in a lab setting.

Fabian Giesen May 22, 2025

@pervognsen @brouhaha The semi-consistent repro that one customer managed (I still have never seen it live) uses mild overclocking.

The symptoms are the same as on non-overclocked machines seeing the crashes, whether it's actually the same thing we simply do not know.

Per Vognsen May 22, 2025

@rygorous @brouhaha A timing closure issue at only 5% overclock definitely seems like the kind of thing that could trigger without overclocking given manufacturing process variance, temperature and random noise.

🇺🇦 haxadecimal 🚫👑May 22, 2025

@pervognsen @rygorous
Yes, having only 5% margin on binning seems crazy to me, though when I worked for a fabless semiconductor company, I was not involved with timing closure. There were rumors that Intel cut back a lot on verification, but this suggests problems even before that stage.

Aaron Sawdey, Ph.D.May 22, 2025

@brouhaha @pervognsen @rygorous I would say this is a timing or physical design miss .. what I know as “verification” is more of “does the logic work correctly if you just simulate the logical operation of the circuits”. If the timing is this close that a 5% overclock hits it .. you’re going to hit it at nominal frequency because of dynamic voltage sag and such.

Fabian Giesen May 22, 2025

@acsawdey @brouhaha @pervognsen It's already confirmed that the underlying root cause is physical circuit degradation of the clock tree over time.

https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239

These machines start out with larger margins and then degrade over time getting worse and worse clock skew until stuff starts breaking.

Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the root cause diagnosis for the issue. This post will cover Intel’s understanding of the root cause, as well as additional mitigations and next steps for Int...

Aaron Sawdey, Ph.D.May 22, 2025

@rygorous @brouhaha @pervognsen well there you go. That’s a process vs operating temperature and voltage problem most likely.

MarkAtMicrochip May 23, 2025

@acsawdey @rygorous @brouhaha @pervognsen I had a college professor say something to effect of: “any IC design will work at a particular frequency, temperature, and voltage. You’re aiming for above 1Hz, above absolute zero, and below the breakdown voltage.”

Aaron Sawdey, Ph.D.May 23, 2025

@MarkAtMicrochip @rygorous @brouhaha @pervognsen 😂 yes .. but the problem comes when you want that operating point to produce performance better than your competitor’s product.

Nicolás Alvarez May 26, 2025

@acsawdey @rygorous @brouhaha @pervognsen afaik even if you stop overclocking and apply all the microcode and motherboard-BIOS updates recommended by Intel, the bad operating conditions already damaged the circuitry, if it was misbehaving it will keep misbehaving. You can only stop it from getting worse.

Fabian Giesen May 26, 2025

@nicolas17 you don't need to have ever overclocked for that to happen FWIW, they do that all on their own, the overclocking here is just part of getting a more consistent repro

Tom Forsyth May 22, 2025

@rygorous @pervognsen @brouhaha "That's my secret Cap. I'm always overclocked."

Tom Forsyth May 22, 2025

@rygorous @pervognsen @brouhaha But yeah, the line between "overclocked" and "binned" has been fuzzy for many years. I think the only novel thing here is the faster degradation over time, which means that parts become mis-binned a lot quicker.

Jef Poskanzer May 22, 2025

@TomF @rygorous @pervognsen @brouhaha Tempted to underclock my next system. Correct is way better than slightly faster and wrong.

Fabian Giesen May 23, 2025

@jef @TomF @pervognsen @brouhaha Unlikely to help, if you're on one of the older microcode revisions, FWIW. The changes they've been making have something to do with the frequency boosting behavior and if you underclock the CPU, that'll just give it even more thermal and power headroom and might actually make the problem worse.

Maybe if you did that and also turn off turbo boost entirely, but at that point you're grossly overpaying for the CPU power you're actually getting.

Fabian Giesen May 23, 2025

@jef @TomF @pervognsen @brouhaha My recommendation (not speaking for my employer or anyone else here) is to either get Intel 12th gen (Alder Lake), the newer ones (Arrow Lake), or one of the AMD Zen CPUs. Just avoid Intels 13th/14th gen CPUs entirely. They might be a lot better now, but the fact that they're still regularly putting out new uCode revisions trying to work around this tells me that there's something fundamentally broken with that entire generation.

Graham Sutherland / Polynomial May 23, 2025

@TomF @rygorous @pervognsen @brouhaha it's also complicated by the fact that we now have at least three clock scaling states, typically four or more, with complex scaling interactions determined by outside factors like thermals, VRM power targets, adjacent core load, and config values set by the motherboard, not to mention production variances in the hardware. it's practically a recipe for nasty faults where a particular load pattern coincides with a particular transient hardware state.

🇺🇦 haxadecimal 🚫👑May 22, 2025

@pervognsen @rygorous
Ah, that makes sense. If it happens without overclocking and overtemp, then Intel doesn't have sufficient matgins in their binning.

Cassandrich May 22, 2025

@pervognsen @brouhaha @rygorous I mean I'd just take that as indication you need to underclock these pieces of shit by 20% or treat them as ewaste. 🤷

Very Human Robot May 22, 2025

@brouhaha @rygorous

Users who overclock won't return their CPUs to Intel, they'll just leave your game a bad review and ask for their money back.

Not "fair" but "true."

Jann Horn May 21, 2025

@rygorous is this related to that "Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behavior" erratum from years ago or is this an entirely separate issue that again involves the *H registers?

Jann Horn May 21, 2025

@rygorous ah, nevermind, since yours is in Raptor Lake it has to be unrelated...

Fabian Giesen May 22, 2025

@jann Yeah, unrelated AFAICT

meta May 22, 2025

@rygorous yikes!!!

Very Human Robot May 22, 2025

@rygorous How about "actually, we don't support overclocking?"

(I get that customers will complain no matter what, so this is probably the optimal solution though ...)

Fabian Giesen May 22, 2025

@StompyRobot oh they do that without overclocking too, OC just makes for a more consistent repro

Kim Spence-Jones 🇬🇧😷May 22, 2025

@rygorous @StompyRobot
😱

Fabian Giesen May 22, 2025

@KimSJ @StompyRobot to reiterate, this is a known HW problem that made the rounds last year and has been confirmed by Intel https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239 as a physical problem with the circuit that causes it to degrade over time. To some extent all chips have this kind of thing but normally this happens on a timescale of several years of intensive use, not weeks or months

Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the root cause diagnosis for the issue. This post will cover Intel’s understanding of the root cause, as well as additional mitigations and next steps for Int...

190n May 22, 2025

@rygorous wow that's insane. Have you looked for other occurrences of this that you might need to change or is it only the one in a hot loop? Can it happen with other registers? Does it mix up cl and bpl if rex is used?

m_on_stair May 22, 2025

@rygorous
wau....

can you publish the repro?

Fabian Giesen May 22, 2025

@m I have still never seen that crash in person, or even gotten a live debug session.

This was all just via emails back in forth with a company on the other side of the world

Espresso macchiato May 22, 2025

Good times