Daniel Estévez

@destevez
872 Followers
96 Following
888 Posts
Everything space & RF. Amateur radio operator (EA4GPZ / M0HXM). PhD in Mathematics from. he/him
Bloghttps://destevez.net
Twitterhttps://twitter.com/ea4gpz
Keyoxideaspe:keyoxide.org:MPXJS6MJRS3LENLWBNOOAMHOMY
This is somewhat hidden to the programmer because the linker script tries to place the buffers in different banks when possible. However, it's relevant when there are more buffers than banks or when trying to access multiple elements from the same buffer, which is what I ran into
More fun with the Ryzen AI NPU. I'm working on a more complex algorithm, and I'm seeing memory stalls that make me lose one cycle in each iteration. It turns out that the data memory of compute tiles is organized as 8x 256-bit wide 8 KiB memory banks. Each bank supports one simultaneous access, and every two banks are interleaved to form a 512-bit wide 16 KiB virtual bank. The two load units of the processor can perform simultaneous 512-bit loads only if these target different 16 KiB banks.

This post is an ideal self-contained introduction if you want to learn how NPUs work from a low-level perspective.

Read more: https://destevez.net/2026/05/getting-peak-tops-on-a-ryzen-ai-7-350-npu/

Getting peak TOPS on a Ryzen AI 7 350 NPU – Daniel Estévez

I explain how the IRON Python API is used to generate LLVM MLIR that defines how the NPU is set up, including the configuration of all the DMAs used for data movement. I go through the relevant sections of the MLIR code and explain how it is compiled to lower level objects. Finally, I show how to use tracing to measure the performance of the NPU workload execution, and check that it matches the understanding we had obtained by analyzing the assembly code.
Then I explain how SIMD operations are fundamentally intended for matrix multiplication operations. For instance, an 8x8 times 8x8 matrix multiplication of int8 values can be done in a single SIMD instruction which performs 1024 integer operations in a single clock cycle. I implement a C++ kernel and show how it maps to assembly, and how to read the assembly to detect performance losses.
I start by giving an overview of the NPU hardware, explaining how it is organized as an array of compute, memory, and shimNOC tiles connected together mainly by an AXI-S interconnect for wide bandwidth data movement. I also explain the exposed-pipeline VLIW SIMD architecture.
New blog post: Getting peak TOPS on a Ryzen AI 7 350 NPU. This is an introduction to low-level programming on AMD NPUs using mlir-aie. I build an example that demonstrates 56 TOPS, very close to the max theoretical performance. These NPUs are identical to Xilinx AIE-MLv2 engines.

I relate all these decodes to the uplink decodes from my previous post, explaining the UE and network conversations.

Read more: https://destevez.net/2026/04/decoding-the-nb-iot-downlink/

I explain how the Type2 common and UE-specific search spaces for the NPDCCH work, how to do blind decoding, and how to decode the NPDSCH transmissions. The SNR in these transmissions is not great, but I'm impressed of how the repetition and FEC often give good decodes.
Next I show how to spot paging messages by their power, and how to decode their corresponding NPDCCH and NPDSCH transmissions. I explain how paging scheduling works in terms of the UE IMSI. Finally I decode the downlink messages sent to the V16 beacon that I was testing. The eNB to which the beacon had attached is the one that is not synchronized with the rest, and unfortunately it is weaker than other eNBs, so some messages cannot be decoded due to interference.