Mastodawn

Fabian Giesen May 21, 2025

What's that mysterious workaround?

Core Huff6 decode step is described in https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/

A customer managed to get a fairly consistent repro for transient decode errors by overclocking an i7-14700KF by about 5% from stock settings ("performance" multiplier 56->59).

It took weeks of back and forth and forensic debugging to figure out what actually happens, but TL;DR: the observed decode errors are all consistent with a single instruction misbehaving.

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the las…

The ryg blog

Show thread

Fabian Giesen May 21, 2025

This instruction:
mov [rDest + <index>], ch

under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.

(this was "fun" to debug.)

The workaround: when we detect Raptor Lake CPUs, we now do

shr ecx, 8
mov [rDest + <index>], cl

instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.

Show thread

🇺🇦 haxadecimal 🚫👑May 21, 2025

@rygorous
Or don't overclock?
But definitely mad props on the detective work.

Show thread

Per Vognsen May 22, 2025

@brouhaha @rygorous Fabian can clarify further but I thought it did happen without overclocking, just with extreme rarity? The overclocking is used to make it reproducible in a lab setting.

Show thread

Fabian Giesen May 22, 2025

@pervognsen @brouhaha The semi-consistent repro that one customer managed (I still have never seen it live) uses mild overclocking.

The symptoms are the same as on non-overclocked machines seeing the crashes, whether it's actually the same thing we simply do not know.

Show thread

Per Vognsen May 22, 2025

@rygorous @brouhaha A timing closure issue at only 5% overclock definitely seems like the kind of thing that could trigger without overclocking given manufacturing process variance, temperature and random noise.

Show thread

🇺🇦 haxadecimal 🚫👑

@pervognsen @rygorous
Yes, having only 5% margin on binning seems crazy to me, though when I worked for a fabless semiconductor company, I was not involved with timing closure. There were rumors that Intel cut back a lot on verification, but this suggests problems even before that stage.

Show thread

Aaron Sawdey, Ph.D.May 22, 2025

@brouhaha @pervognsen @rygorous I would say this is a timing or physical design miss .. what I know as “verification” is more of “does the logic work correctly if you just simulate the logical operation of the circuits”. If the timing is this close that a 5% overclock hits it .. you’re going to hit it at nominal frequency because of dynamic voltage sag and such.

Show thread

Fabian Giesen May 22, 2025

@acsawdey @brouhaha @pervognsen It's already confirmed that the underlying root cause is physical circuit degradation of the clock tree over time.

https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239

These machines start out with larger margins and then degrade over time getting worse and worse clock skew until stuff starts breaking.

Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the root cause diagnosis for the issue. This post will cover Intel’s understanding of the root cause, as well as additional mitigations and next steps for Int...

Show thread

Aaron Sawdey, Ph.D.May 22, 2025

@rygorous @brouhaha @pervognsen well there you go. That’s a process vs operating temperature and voltage problem most likely.

Show thread

MarkAtMicrochip May 23, 2025

@acsawdey @rygorous @brouhaha @pervognsen I had a college professor say something to effect of: “any IC design will work at a particular frequency, temperature, and voltage. You’re aiming for above 1Hz, above absolute zero, and below the breakdown voltage.”

Show thread

Aaron Sawdey, Ph.D.May 23, 2025

@MarkAtMicrochip @rygorous @brouhaha @pervognsen 😂 yes .. but the problem comes when you want that operating point to produce performance better than your competitor’s product.

Show thread

Nicolás Alvarez May 26, 2025

@acsawdey @rygorous @brouhaha @pervognsen afaik even if you stop overclocking and apply all the microcode and motherboard-BIOS updates recommended by Intel, the bad operating conditions already damaged the circuitry, if it was misbehaving it will keep misbehaving. You can only stop it from getting worse.

Show thread

Fabian Giesen May 26, 2025

@nicolas17 you don't need to have ever overclocked for that to happen FWIW, they do that all on their own, the overclocking here is just part of getting a more consistent repro