Mastodawn

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves 7.43x speed up and 31.1x higher energy efficiency than Nvidia H100 GPU. Compared to SOTA accelerators, PADE achieves 5.1x, 4.3x and 3.4x energy saving than Sanger, DOTA and SOFA.

arXiv.org

Muzaffer Kal Dec 22

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

https://arxiv.org/abs/2512.14661

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Vision-Language Models (VLMs) have demonstrated strong performance on tasks such as video captioning and visual question answering. However, their growing scale and video-level inputs lead to significant computational and memory overhead, posing challenges for real-time deployment on hardware accelerators. While prior work attempts to reduce redundancy via token pruning or merging, these methods typically operate at coarse granularity and incur high runtime overhead due to global token-level operations. In this study, we propose Focus, a Streaming Concentration Architecture that efficiently accelerates VLM inference through progressive, fine-grained redundancy elimination. Focus introduces a multilevel concentration paradigm that hierarchically compresses vision-language inputs at three levels: (1) semantic-guided token pruning based on textual prompts, (2) spatial-temporal block-level concentration using localized comparisons, and (3) vector-level redundancy removal via motion-aware matching. All concentration steps are tightly co-designed with the architecture to support streaming-friendly, on-chip execution. Focus leverages GEMM tiling, convolution-style layout, and cross-modal attention to minimize off-chip access while enabling high throughput. Implemented as a modular unit within a systolic-array accelerator, Focus achieves a 2.4x speedup and 3.3x reduction in energy, significantly outperforming state-of-the-art accelerators in both performance and energy efficiency. Full-stack implementation of Focus is open-sourced at https://github.com/dubcyfor3/Focus.

arXiv.org

Muzaffer Kal Dec 22

Torrent:ADistributedDMAforEfficientand FlexiblePoint-to-MultipointDataMovement

https://arxiv.org/abs/2512.17589

Torrent: A Distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement

The growing disparity between computational power and on-chip communication bandwidth is a critical bottleneck in modern Systems-on-Chip (SoCs), especially for data-parallel workloads like AI. Efficient point-to-multipoint (P2MP) data movement, such as multicast, is essential for high performance. However, native multicast support is lacking in standard interconnect protocols. Existing P2MP solutions, such as multicast-capable Network-on-Chip (NoC), impose additional overhead to the network hardware and require modifications to the interconnect protocol, compromising scalability and compatibility. This paper introduces Torrent, a novel distributed DMA architecture that enables efficient P2MP data transfers without modifying NoC hardware and interconnect protocol. Torrent conducts P2MP data transfers by forming logical chains over the NoC, where the data traverses through targeted destinations resembling a linked list. This Chainwrite mechanism preserves the P2P nature of every data transfer while enabling flexible data transfers to an unlimited number of destinations. To optimize the performance and energy consumption of Chainwrite, two scheduling algorithms are developed to determine the optimal chain order based on NoC topology. Our RTL and FPGA prototype evaluations using both synthetic and real workloads demonstrate significant advantages in performance, flexibility, and scalability over network-layer multicast. Compared to the unicast baseline, Torrent achieves up to a 7.88x speedup. ASIC synthesis on 16nm technology confirms the architecture's minimal footprint in area (1.2%) and power (2.3%). Thanks to the Chainwrite, Torrent delivers scalable P2MP data transfers with a small cycle overhead of 82CC and area overhead of 207um2 per destination.

arXiv.org

Muzaffer Kal Oct 4

Show thread

Terence Tao Oct 2

I encountered no issues with hallucinations or other AI-generated nonsense. I think the reason for this is that I already had a pretty good idea of what the tedious computational tasks that needed to be performed, and could explain them in detail to the AI in a step-by-step fashion, with each step confirmed in a conversation with the AI before moving on to the next step. After switching strategies to the conversational approach, external validation with Python was only used at the very end, when the AI was able to generate numerical outputs that it claimed to obey the required constraints (which they did).

Muzaffer Kal Jul 9, 2024

File it under: “Global warming is a hoax.”

As record-breaking heat blankets the West, no end in sight

https://t.co/HVUwJwtxmA

As record-breaking heat blankets the West, no end in sight

The heat caused a road in Washington to buckle, officials said.

ABC News

Muzaffer Kal Jan 3, 2024

Andrew Dessler Jan 2, 2024

Calls/emails from reporters asking for a comment on last's years temperature are coming earlier than normal this year. Time to dust off my "last year was hot" auto-response.

Muzaffer Kal Jan 3, 2024

If I won the lottery, there would be signs

Show thread

Muzaffer Kal Aug 24, 2023

@samwho also the replacement cost should be considered

Muzaffer Kal Aug 24, 2023

Sam Rose Feb 19, 2023

Something that stuck with me from a previous job is the quote: “don’t underestimate things that have survived many attempts to kill them.”

Think: DNS, bash, C, TCP.

These things have survived this long for a reason. Find out the reason.

Show thread

Muzaffer Kal Aug 11, 2023

@seanb which ones are they?