Flash-KMeans: Fast and Memory-Efficient Exact K-Means

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.

arXiv.org

Generator Expressions: Memory-Efficient MAGIC!

Python's generator expressions vs PHP's generators - which saves more memory? INSANE results!

#php #python #phpvspython #generators #generatorexpressions #yield #lazyevaluation #memoryefficiency #viralcoding #pythonmagic #mindblown #performance

https://www.youtube.com/watch?v=SSbCvMSXeik

Generator Expressions: Memory-Efficient MAGIC! #phpvspython

YouTube
Stream Huge CSVs Without Memory Explosions #CSV

YouTube

Generator Functions vs Async Generators: Memory Efficiency Battle

JavaScript generators vs Python async generators. Which language's generator pattern is more powerful for memory-efficient data processing? Mind = blown!

#javascript #python #javascriptvspython #generators #asyncgenerators #memoryefficiency #lazyevaluation #programmingcomparison #codecomparison #javascripttricks #pythontricks #yield #viralcoding #codingshorts #iterators

https://www.youtube.com/watch?v=dr3H2WUnw7Q

Generator Functions vs Async Generators: Memory Efficiency Battle #iterators

YouTube

AI accelerates coding but risks bloated, inefficient code. Lean tools ensure verification, optimization, and resource constraints to build efficient, sustainable apps. #AI #MemoryEfficiency #SoftwareEngineering #ToolChains #LeanDevelopment

https://saysomething.hashnode.dev/lean-ai-leaner-apps-building-efficient-software-with-ai-assisted-development

Building a Fast, Memory-Efficient Hash Table in Java (by borrowing the best ideas)

One day, I ran into SwissTable—the kind of design that makes you squint, grin, and immediately regret every naive linear-probing table you’ve ever shipped. This post is the story of how I tried to bring that same “why is this so fast?” feeling into Java. It’s part deep dive, part engineering diary, and part cautionary tale about performance work. 1) The SwissTable project, explained the way it feels when you first understand it SwissTable is an open-addressing hash table design that came out of Google’s work and was famously presented as a new C++ hash table approach (and later shipped in Abseil).

Bluue Whale
Memory Efficiency in iOS: Reducing footprint and beyond

Make responsive and performant apps

Anton’s Substack

🚀 Announcing Python-Blosc2 3.5.1

We, Blosc developers, understand that memory efficiency is critical when working with large datasets. To that end, we continuously profile and optimize our codebase to deliver the best possible performance.

This version introduces significant performance and memory optimizations, enhancing the experience of computing with large, compressed datasets.

Compress Better, Compute Bigger!

#Performance #MemoryEfficiency #DataScience #BigData #OpenSource

ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs

The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5\% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.

arXiv.org