Speculative decode is an inferencing optimization that was mentioned a few times at #GTC26. I'd heard of it but didn't know how it worked, so I spent some time figuring it out. Some notes (and toy code that illustrates its benefits!) are here: glennklockwood.com/garden/specu... #AI

speculative decode
speculative decode

Speculative decode is an inferencing optimization where you use a small draft model to generate a sequence of output tokens, then run those draft output tokens through a full-sized model as a single multi-token decode to determine whether the draft model’s output tokens are correct.

Glenn's Digital Garden