We have an upcoming paper at ICML 2025.
đź“„Paper: https://www.eecg.utoronto.ca/~mcj/papers/2025.alspec.icml.pdf
⚙️Code: https://github.com/mcj-group/alspec
This gist: When a large-language model (LLM) reacts to a prompt to generate text (inference), one of the most important but slowest stages of the computation is called attention. One way to speed up attention is to use more compute devices (e.g., multiple GPUs or accelerators) to work collaboratively on it (tensor parallelism). Unfortunately, when scaling to 8 or more devices, the communication between them overwhelms the benefit of their increased computational ability. Another way to speed up attention is to use an approximation of its underlying math. Prior approaches to approximate attention can work well for particular inference tasks, but in others, like solving math problems, the quality of the generated text is poor. We propose attention-level speculation, a technique that combines and enhances multi-device and approximate approaches to speed up LLM inference. Attention-level speculation sometimes uses the output of the attention approximation, but sometimes does not, verifying on the fly whether the approximation was of good quality. Using two devices, we overlap the verification of approximation quality with speculative downstream computation. Speculation succeeds for up to 90% of attention operations. Our experiments with Tenstorrent N150s suggest that using attention-level speculation in combination with tensor parallelism across 8 devices is up to 1.65x times faster than using tensor parallelism alone on 8 devices.
This project was led by Jack Cai, a recent BASc grad from University of Toronto Engineering and a former Tenstorrent intern. Jack pitched this for an undergraduate thesis project and I did not think it would work. Shame on me and amazing work and persistence by Jack. Many thanks to our co-authors Ammar Vora, Randalph Zhang, and Mark O'Connor.