Happy to share that our work analyzing the performance of ARM Memory tagging extensions on real hardware has been accepted at USENIX Security 2026. We look at the performance of MTE on Pixel 8, Pixel 9, AmpereOne CPUs, and even have some preliminary analysis of MTE performance on the Mac M5. This covers MTE on phones, laptops, and server chips.
Context: Memory safety bugs represent the majority (50% to 70%) of bugs in systems like Chrome, Windows etc. ARM's memory tagging is a hardware feature that can be used to probabilistically detect memory safety bugs. Unfortunately, most discussions of MTE's performance overhead so far have been vague. Since MTE is now available in a handful of devices we decided to take a look!
Highlights
----------------
- Performance overheads of MTE can vary widely according to micro-architecture and benchmark.
- SYNC MTE can indeed be implemented efficiently as shown by the implementation on the AmpereOne, Mac M5, and Pixel's Little core. But micro-architectural details can lead to large overheads on specific benchmarks. E.g.: Causes a factor of 2x to 6x on Pixel's performance core, 1.8x on Pixel's big core, 1.43x on AmpereOne cores, and 1.29x on Mac M5 cores. Interestingly, on receiving our report, Ampere noted they had also discovered this internally and have fixed this in future chips.
- ASYNC MTE is faster than SYNC MTE and imposes very low overheads on Pixel's Performance and Little core. However, counter to conventional wisdom, subtle micro-architectural issues can compromise in ASYNC MTE's performance on specific workloads and micro-architectures - E.g., 1.8x on Pixel's big core.
- MTE's runtime support matters! The Linux kernel's support for MTE sometimes resulted in 25% throughput drop on Memcached on AmpereOne chips due to an assumption made based on an ambiguous part of the ARM MTE specification. We submitted a patch for this to the Linux kernel mailing list which eliminates almost all of the overhead on this benchmark.
- Prior academic work tested MTE using performance analogs. This was reasonable since MTE hardware has only been available recently. Unfortunately, we see that the analogs don't reflect real world performance. Further, assumptions of MTE performance from prior work on real devices are at best, incomplete (and at worst: wrong) due to testing on a single MTE micro-architecture or in some cases: benchmarking bugs.
- Finally, for use cases beyond enforcing memory safety, the first generation of MTE implementations has mixed results. Data tracing and copy elision can use MTE for speedups today, while CFI and SFI (in-process sandboxing) don't yet show clear performance with today's MTE implementations.
I suspect USENIX will have the papers up soon, but in case you want to look through the specific details, a copy of the paper is available here.
https://shravanrn.com/pubs/mte-extended
Kudos to the whole team and especially to Taehyun for driving this work!
Team: Taehyun Noh (@taehyun), Yingchen Wang, Tal Garfinkel (https://www.linkedin.com/in/tal-garfinkel-937528/), Mahesh Madhav(https://www.linkedin.com/in/mahesh-madhav/), Daniel Moghimi (@flowyroll), Mattan Erez(https://lph.ece.utexas.edu/merez/), Shravan Narayan (@shravanrn)
Also shout out to Mahesh Madhav and Carl Worth from Ampere for their help with Ampere infrastructure and identifying/testing kernel patches for MTE performance.