Alex Cheema (@alexocheema)

oMLX가 Mac에서 계층형 KV 캐싱을 지원하게 되었다. Apple Silicon에서 prefill 시간이 긴 문제를 줄이고, 세션 간에도 KV 캐시를 디스크에 저장해 중복 prefill을 피할 수 있어 온디바이스 AI 성능 최적화에 중요한 개선이다.

https://x.com/alexocheema/status/2044188027468025934

#omlx #kvcaching #applesilicon #ondeviceai #performance

Alex Cheema (@alexocheema) on X

oMLX brought tiered kv caching to Mac. Especially important with Apple Silicon where prefill time is very long - you avoid redundant prefills, even between sessions by persisting kv caches to disk.

X (formerly Twitter)

Avi Chawla (@_avichawla)

KV 캐싱을 사용할 때와 사용하지 않을 때의 LLM 추론 속도를 비교하며, KV 캐싱이 왜 성능 향상에 중요한지 설명하는 기술 공유 트윗입니다. LLM 서빙 최적화와 추론 효율 개선에 관심 있는 개발자에게 유용한 내용입니다.

https://x.com/_avichawla/status/2035084029062750714

#llm #inference #kvcaching #optimization #serving

Avi Chawla (@_avichawla) on X

LLM inference speed with vs. without KV caching: (learn how and why it works below)

X (formerly Twitter)

New research shows KV‑cache compaction can slash LLM memory usage by up to 50× while preserving quality. With chunked processing and attention‑matching tricks, models like Llama 3.1 and Qwen‑3 handle far longer contexts—great news for open‑source and enterprise workloads. Dive into the benchmarks! #KVCaching #LLMMemory #LongContexts #ModelCompression

🔗 https://aidailypost.com/news/kv-cache-compaction-cuts-llm-memory-50-chunked-processing-long

KV caching is a necessity on modern #LLMs, but it's not easy do to right. There's a literal zoo of techniques designed to handle it on many different levels. What to use and how are the benefits of each?

In this post I go through a recent survey article that collects and categorizes the most important KV caching techniques released in the last months. Brace yourself for a deep dive!

https://www.zansara.dev/posts/2025-10-26-kv-caching-optimizations-intro/

#AI #GenAI #LLM #KVcaching #vllm

Making sense of KV Cache optimizations, Ep. 1: An overview

Let's make sense of the zoo of techniques that exist out there.

Sara Zan

Do you know how exactly prompt caching works in #GPT models? What is cached, at which stage? Let's have a deep dive into KV caching and how it makes your #LLM inference speed constant regardless of the prompt size.

https://www.zansara.dev/posts/2025-10-23-kv-caching/

#AI #GenAI #kvcaching

How does prompt caching work?

Nearly all inference libraries can do it for you. But what's really going on under the hood?

Sara Zan