What's that mysterious workaround?

Core Huff6 decode step is described in https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/

A customer managed to get a fairly consistent repro for transient decode errors by overclocking an i7-14700KF by about 5% from stock settings ("performance" multiplier 56->59).

It took weeks of back and forth and forensic debugging to figure out what actually happens, but TL;DR: the observed decode errors are all consistent with a single instruction misbehaving.

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the las…

The ryg blog

This instruction:
mov [rDest + <index>], ch

under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.

(this was "fun" to debug.)

The workaround: when we detect Raptor Lake CPUs, we now do

shr ecx, 8
mov [rDest + <index>], cl

instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.

@rygorous is this related to that "Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behavior" erratum from years ago or is this an entirely separate issue that again involves the *H registers?
@rygorous ah, nevermind, since yours is in Raptor Lake it has to be unrelated...