What's that mysterious workaround?

Core Huff6 decode step is described in https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/

A customer managed to get a fairly consistent repro for transient decode errors by overclocking an i7-14700KF by about 5% from stock settings ("performance" multiplier 56->59).

It took weeks of back and forth and forensic debugging to figure out what actually happens, but TL;DR: the observed decode errors are all consistent with a single instruction misbehaving.

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the las…

The ryg blog

This instruction:
mov [rDest + <index>], ch

under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.

(this was "fun" to debug.)

The workaround: when we detect Raptor Lake CPUs, we now do

shr ecx, 8
mov [rDest + <index>], cl

instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.

@rygorous "when we detect Raptor Lake CPUs, " -- do you run different code at different CPU models? how high up is the split (if not a secret)

@msinilo We don't have many of those.

Generally it's just broad feature levels like the BMI2 vs. not.

The Huffman decoders have a lot of special cases though:
- AMD Jaguar 64-bit (for PS4/Xb1)
- AMD Zen 2+ 64-bit (for PS5/XSX, but also used on desktop Zen CPUs)
- BMI x86 64-bit
- generic x86 64-bit
- x86 32-bit with SSE2
- pre-SSE2 x86 32-bit

and that's just x86, ARM has more (4 Huffman decoder kernels: Cortex-A55 class, Cortex-A57/A72 class, Cortex-A78+ class, M1+-class)

@msinilo it is just those loops and they're fundamentally dead simple, they're just decoding Huffman-coded bytes using a 11-bit length-limited code in a particular stream layout.

Having this be standalone and not interleaved with other processing lets us specialize a lot more than we otherwise would.

@rygorous Which BMI2 instructions do you use?
@camelcdr just SHRX. 1 uop for variable-distance shift on Haswell and up vs. 3 uops on most Intel CPUs in that range for "SHR reg, cl".
@camelcdr if you have an unrolled 4-instruction loop per byte, and one of those four instructions is a variable shift, that shift taking 1/3rd the backend resources is very noticeable.