New blog post: Getting peak TOPS on a Ryzen AI 7 350 NPU. This is an introduction to low-level programming on AMD NPUs using mlir-aie. I build an example that demonstrates 56 TOPS, very close to the max theoretical performance. These NPUs are identical to Xilinx AIE-MLv2 engines.
I start by giving an overview of the NPU hardware, explaining how it is organized as an array of compute, memory, and shimNOC tiles connected together mainly by an AXI-S interconnect for wide bandwidth data movement. I also explain the exposed-pipeline VLIW SIMD architecture.
Then I explain how SIMD operations are fundamentally intended for matrix multiplication operations. For instance, an 8x8 times 8x8 matrix multiplication of int8 values can be done in a single SIMD instruction which performs 1024 integer operations in a single clock cycle. I implement a C++ kernel and show how it maps to assembly, and how to read the assembly to detect performance losses.
I explain how the IRON Python API is used to generate LLVM MLIR that defines how the NPU is set up, including the configuration of all the DMAs used for data movement. I go through the relevant sections of the MLIR code and explain how it is compiled to lower level objects. Finally, I show how to use tracing to measure the performance of the NPU workload execution, and check that it matches the understanding we had obtained by analyzing the assembly code.

This post is an ideal self-contained introduction if you want to learn how NPUs work from a low-level perspective.

Read more: https://destevez.net/2026/05/getting-peak-tops-on-a-ryzen-ai-7-350-npu/

Getting peak TOPS on a Ryzen AI 7 350 NPU – Daniel Estévez

@destevez “blog post” is stretching the definition of what might as well be a book! Incredible work!