Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter
https://arxiv.org/abs/2604.15039
#arxiv

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization.
We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth.
arXiv.orgfly51fly (@fly51fly)
코넬대와 MIT 연구진이 Sequential Monte Carlo를 활용해 LLM 추론 속도를 높이는 새로운 방법을 발표했다. 이 연구는 대규모 언어모델의 추론 효율을 개선해 더 빠르고 실용적인 배포를 가능하게 할 잠재력이 있다.
https://x.com/fly51fly/status/2046341249880469533
#llm #inference #sequentialmontecarlo #optimization #arxiv

fly51fly (@fly51fly) on X
[LG] Faster LLM Inference via Sequential Monte Carlo
Y Emara, M B d Costa, C Chang, C Freer… [Cornell University & MIT] (2026)
https://t.co/d4AiUwGReW
X (formerly Twitter)fly51fly (@fly51fly)
Google Cloud 연구진이 LLM 에이전트의 정책 이해를 향상시키기 위한 PolicyBank를 제안했다. 이 연구는 정책을 더 잘 해석하고 적용하도록 돕는 새로운 접근으로, 에이전트가 복잡한 규칙과 제약을 따르는 능력을 강화하는 데 초점을 맞춘다.
https://x.com/fly51fly/status/2046343254292214225
#llm #agents #policy #googlecloud #arxiv

fly51fly (@fly51fly) on X
[CL] PolicyBank: Evolving Policy Understanding for LLM Agents
J Choi, J Yoon, L T. Le, S Jha… [Google Cloud] (2026)
https://t.co/ZGViNzXf8d
X (formerly Twitter)KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
https://arxiv.org/abs/2604.15356
#arxiv

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity -- approximately 10-20 for fluent English text -- this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant's 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
arXiv.orgfly51fly (@fly51fly)
Self-Distillation Zero는 이진 보상만으로 학습하던 방식을 자기 수정(Self-Revision)으로 바꿔, 밀도 높은 지도 신호를 만드는 방법을 제안합니다. RL/선호학습의 보상 희소성 문제를 완화할 수 있는 중요한 연구입니다.
https://x.com/fly51fly/status/2045620305318806007
#selfdistillation #reinforcementlearning #densesupervision #llm #arxiv
fly51fly (@fly51fly)
ML 연구를 위한 자율적 장기 과제 수행 프레임워크를 다룬 논문입니다. 긴 호흡의 엔지니어링 작업을 에이전트가 스스로 계획·실행하는 방향을 제시해, 자율 연구 자동화와 AI 코딩/연구 보조 도구 발전에 중요한 시사점을 줍니다.
https://x.com/fly51fly/status/2045622070692974710
#autonomousai #mlresearch #agent #longhorizon #arxiv
fly51fly (@fly51fly)
INRIA Lille와 Google DeepMind 연구진이 표본 효율적인 몬테카를로 플래닝 기법인 "Sample-efficient Monte-Carlo planning" 논문을 arXiv에 공개했다. 강화학습·계획 분야에서 적은 샘플로 더 효율적으로 탐색하는 새로운 연구로 보인다.
https://x.com/fly51fly/status/2045252557430493624
#reinforcementlearning #planning #montecarlo #deeplearning #arxiv

fly51fly (@fly51fly) on X
[CL] Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
J Grill, M Valko, R Munos [INRIA Lille & Google DeepMind] (2026)
https://t.co/mt11Ph7iAv
X (formerly Twitter)fly51fly (@fly51fly)
대규모 언어모델의 구조적 축소를 다루는 새 연구입니다. 압축 센싱과 추론 인지(inference-aware) 기법을 결합해 LLM을 더 효율적으로 줄이는 방법을 제안하며, 모델 경량화·최적화 분야에서 주목할 만한 학술 발표입니다.
https://x.com/fly51fly/status/2045260733974508012
#llm #compression #optimization #arxiv #efficiency

fly51fly (@fly51fly) on X
[CL] Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
A Kiruluta [UC Berkeley] (2026)
https://t.co/l88OOcPxGs
X (formerly Twitter)Sir-Bench – benchmark for security incident response agents
https://arxiv.org/abs/2604.12040
#arxiv #security

SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays real incident patterns in controlled cloud environments, producing authentic telemetry with measurable investigation outcomes. Our evaluation methodology introduces three complementary metrics: triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3), assessed through an adversarial LLM-as-Judge that inverts the burden of proof -- requiring concrete forensic evidence to credit investigations. Evaluating our SIR agent on the benchmark demonstrates 97.1% true positive (TP) detection, 73.4% false positive (FP) rejection, and 5.67 novel key findings per case, establishing a baseline against which future investigation agents can be measured.
arXiv.orgfly51fly (@fly51fly)
진화하는 과학 문헌에서 새로운 가설을 생성하는 ‘Continuous Knowledge Metabolism’ 연구가 소개됐다. 문헌이 계속 업데이트되는 환경에서 과학적 가설을 자동 생성하는 AI 방법론으로, 연구 탐색·가설 발굴·지식 축적 자동화에 활용 가능성이 있다.
https://x.com/fly51fly/status/2044530851913077062
#ai #scientificresearch #hypothesisgeneration #arxiv #llm

fly51fly (@fly51fly) on X
[CL] Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature
J Tao, Y Wang, X Liu, M Yang [Central University of Finance and Economics & Beijing Institute of Technology & TsingyuAI] (2026)
https://t.co/gN7LtgVu7v
X (formerly Twitter)