Mercury 2, 확산 방식으로 기존 추론 모델보다 8배 빠른 LLM 등장

Inception Labs의 Mercury 2는 확산(diffusion) 방식으로 구현한 첫 상용 추론 모델. 엔드투엔드 레이턴시 1.7초로 기존 추론 모델 대비 최대 8배 빠른 속도를 제공합니다.

https://aisparkup.com/posts/9652

Wes Roth (@WesRoth)

Inception Labs가 Mercury 2를 출시했다고 발표함. Mercury 2는 기존의 토큰 순차 생성 방식이 아니라 노이즈에서 시작해 전체 시퀀스를 반복적으로 정제하는 'Diffusion LLM' 방식을 채택해 초당 1,000개 이상의 토큰 생성 속도를 달성했다고 설명, 텍스트 생성 패러다임과 처리 속도에 변화가 예상됨.

https://x.com/WesRoth/status/2026703740577923507

#diffusionllm #mercury2 #inceptionlabs #llm #textgeneration

Wes Roth (@WesRoth) on X

The Diffusion LLM Hitting 1,000+ Tokens Per Second Inception Labs just launched Mercury 2, and it is changing how language models generate text. Instead of outputting one word at a time sequentially, Mercury 2 is a "Diffusion LLM." It starts with noise and refines the entire

X (formerly Twitter)

Diffusion LLM 추론 속도 14배 높인 CDLM, 두 가지 병목을 동시에 푼 방법

Together.ai가 공개한 CDLM은 Diffusion Language Model의 추론 속도를 최대 14배 높이는 포스트 트레이닝 기법입니다. KV 캐시 문제와 과도한 정제 스텝, 두 가지 병목을 동시에 해결합니다.

https://aisparkup.com/posts/9502

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

arXiv.org