🎓 ArXiv, the pioneering preprint server, declares independence from Cornell
「 The move will help arXiv raise more money from a broader range of donors to fund the staffing and technology needed to support the site’s skyrocketing number of preprints—expected to top 300,000 this year—says Greg Morrisett, dean and vice provost of Cornell Tech, the graduate-education and research arm of the university that manages arXiv 」
ArXiv declares independence from Cornell
fly51fly (@fly51fly)
IBM Research 소속 연구진이 중간 학습(mid-training)에서의 retention과 interaction을 다루는 PRISM 연구를 공개했습니다. AI 모델 학습 역학과 성능 유지에 관한 새로운 연구 결과로, 대규모 언어모델 학습 최적화에 참고할 만한 내용입니다.
#arxiv wird unabhängig und spendenfinanziert

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.
ArXiv Declares Independence from Cornell
#HackerNews #ArXiv #Independence #Cornell #Preprint #Server #Science #News
ArXiv declares independence from Cornell