구글이 Gemma 4에 Multi-Token Prediction(MTP) drafters를 도입했습니다. 경량 드래프터가 여러 토큰을 추측하고 대상 모델이 병렬 검증해 최대 3배까지 추론 속도를 높이면서 출력 품질과 추론 논리는 유지됩니다. LiteRT-LM·MLX·vLLM·Hugging Face 등과 호환되며 Apache 2.0으로 공개·가중치 배포 중입니다.

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

#ai #gemma4 #speculativedecoding #modeloptimization

Accelerating Gemma 4: faster inference with multi-token prediction drafters

An overview of how Multi-Token Prediction (MTP) drafters are making Gemma 4 models up to 3x faster at inference.

Google

Llama and Spec: MTP Support

llama.cpp 프로젝트에 MTP(Multi Token Prediction) 헤드 지원이 추가되었다. 이 기능은 Qwen3.6 모델에서 테스트되었으며, 기존 대비 약 2배 이상의 속도 향상을 보여준다. MTP 모델은 별도의 컨텍스트와 캐시를 가지며, speculative decoding과 관련된 문제를 해결하기 위한 후크도 포함한다. 벤치마크 결과, 토큰 처리 속도가 크게 개선되어 실시간 응용에 유리하다. 이 업데이트는 오픈소스 LLM 실행 환경에서 성능 최적화에 중요한 진전이다.

https://github.com/ggml-org/llama.cpp/pull/22673

#llama #mtp #speculativedecoding #qwen #llm

llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

Overview This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detaile...

GitHub

merve (@mervenoyann)

Gemma 4에 MTP drafter가 적용되어 speculative decoding으로 기존 대비 최대 3배까지 tokens/sec 속도가 향상됐다. 추론 결과는 동일하면서 훨씬 빨라졌고, transformers, MLX, vLLM에서 출시 첫날부터 지원되며 A2.0 라이선스로 제공된다.

https://x.com/mervenoyann/status/2051702372339003841

#gemma #speculativedecoding #vllm #mlx #transformers

merve (@mervenoyann) on X

Gemma 4 just got a massive speed-up with MTP drafters ⚡️ > speculative decoding (up to 3x tokens/sec improvement compared to normal Gemma-4 🔥) > identical reasoning, just faster > day-0 support in transformers, MLX, vLLM > A2.0 licensed 🤗

X (formerly Twitter)

Google for Developers (@googledevs)

LLM 추론의 autoregressive 병목을 줄이기 위해 Diffusion-Style Speculative Decoding(DFlash)을 적용해 Google Cloud TPU에서 3.13배의 대규모 속도 향상을 달성했다는 연구 결과를 발표했다.

https://x.com/googledevs/status/2051406513097396607

#llm #inference #speculativedecoding #tpu #dflash

Google for Developers (@googledevs) on X

Breaking LLM inference’s autoregressive bottleneck 🛠️ We've teamed up with @haozhangml, @YimingBob, and @aaronzhfeng, among others from UCSD to achieve a massive 3.13X speedup for LLM inference on Google Cloud TPUs using Diffusion-Style Speculative Decoding (DFlash). Read the

X (formerly Twitter)

Hugging Models (@HuggingModels)

Qwen3 기반에 새로운 diffusion-based speculative decoding을 결합한 z-lab/Qwen3.6-27B-DFlash가 소개됐다. flash decoding을 통해 텍스트 생성 속도와 효율성을 높인 모델로, AI 커뮤니티에서 주목받고 있다.

https://x.com/HuggingModels/status/2049772758771646701

#qwen3 #speculativedecoding #textgeneration #diffusion #flashdecoding

Hugging Models (@HuggingModels) on X

Imagine a model that combines the power of Qwen3 with a new diffusion-based speculative decoding. That's z-lab/Qwen3.6-27B-DFlash. It's a text-generation transformer that uses flash decoding for speed and efficiency. The AI community is buzzing about this one.

X (formerly Twitter)

AISatoshi (@AiXsatoshi)

Mistral의 Mistral-Medium-3.5-128B-EAGLE에 대해 speculative decoding 가속 테스트 결과를 공유했다. 비디오 코딩 기준 수용률이 약 25~30% 수준이며, MoE 모델이 빠르지만 Dense 모델과 전용 speculative decoding 모델 조합도 유용하다고 언급했다.

https://x.com/AiXsatoshi/status/2049543302530355622

#mistral #speculativedecoding #llm #aigeneration #moe

AI✖️Satoshi⏩️ (@AiXsatoshi) on X

mistralai/Mistral-Medium-3.5-128B-EAGLE speculative decodingによる高速化テスト 動画のcodingで採択率25~30%ぐらい 快速! MoE快速でよいけど、Denseモデル+専用speculative decoding用モデルも良いね

X (formerly Twitter)

Deedy (@deedydas)

LLM 추론 성능을 크게 끌어올린 블로그 글이 소개됐다. 표준 GPU 환경에 2GB SRAM/chip Corsairs를 더해 speculative decoding을 수행해 지연시간을 10배 줄이고 초당 1400 토큰 이상을 달성했다는 내용으로, gpt-oss-120b 추론 최적화 사례로 주목된다.

https://x.com/deedydas/status/2040083405841568115

#llm #inference #optimization #speculativedecoding #gpu

Deedy (@deedydas) on X

This is the best blog post on LLM inference I've seen this year. They achieved 10x latency and >1400 tokens/sec by moving speculative decode onto two 2GB SRAM/chip Corsairs, a small cost on top of a standard GPU setup on gpt-oss-120b. This performance at this price is insane.

X (formerly Twitter)

fly51fly (@fly51fly)

Microsoft Research Asia와 Peking University 공동 저자들이 발표한 논문 'Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning'는 강화학습을 활용한 적응적 speculative decoding 접근을 제안하는 연구입니다(ArXiv, 2026). 디코딩 속도·품질 향상 관련 새로운 방법론을 다룹니다.

https://x.com/fly51fly/status/2028956988995190960

#speculativedecoding #reinforcementlearning #llm #research

fly51fly (@fly51fly) on X

[CL] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning J Zhang, Z Yu, L Wang, N Yang… [Microsoft Research Asia & Peking University] (2026) https://t.co/OTxG6Fydal

X (formerly Twitter)

New research shows how speculative decoding trains a draft model to guess tokens, then verifies them with the main LLM—cutting compute and boosting token generation speed. The approach promises big gains in model efficiency and opens doors for open‑source AI training. Dive into the details! #SpeculativeDecoding #TokenGeneration #ModelEfficiency #OpenSourceAI

🔗 https://aidailypost.com/news/speculative-decoding-trains-drafter-guess-verify-llm-outputs

Researchers have discovered a clever trick: by embedding a mask token directly into the weight matrix, they can bypass the costly embedding lookup and generate up to three times faster token streams. The method works with parallel computation and speculative decoding, promising big gains for open‑source LLMs. Read on to see how ConfAdapt powers this speed‑up. #LLMinference #SpeculativeDecoding #MultiTokenPrediction #ModelAcceleration

🔗 https://aidailypost.com/news/researchers-embed-mask-token-llm-weights-achieve-3-faster-inference