Mastodawn

Prince Canuma (@Prince_Canuma)

MLX에 TriAttention을 구현한 결과가 공개됐습니다. Gemma-4-31B-IT에서 BF16 기준 6만 토큰까지 KV 캐시를 최대 81% 압축할 수 있다고 합니다. TurboQuant처럼 KV 값을 양자화하는 대신, TriAttention은 중요도가 낮은 토큰을 아예 제거하는 방식입니다.

https://x.com/Prince_Canuma/status/2042021304270819394

#mlx #triaattention #gemma #kvcache #quantization

Prince Canuma (@Prince_Canuma) on X

Just implemented TriAttention in MLX and the results are wild! You can get up to 81% KV compression at 60K tokens for Gemma-4-31B-IT in BF16 🔥 Unlike TurboQuant, which quantizes KV cache values, TriAttention prunes low-importance tokens entirely by scoring keys using

X (formerly Twitter)