
Accelerating FAEST Signatures on ARM: NEON SIMD AES and Parallel VOLE Optimization
FAEST is a post-quantum digital signature candidate whose performance is dominated by repeated AES-CTR-based PRG calls in the VOLE-in-the-Head phase, yet its reference implementation provides no FAEST-specialized ARM NEON acceleration path. We present an ARM-oriented optimization that accelerates this bottleneck using general-purpose NEON SIMD instructions without relying on ARMv8 Crypto Extensions. The proposed implementation combines a register-resident 256-byte S-box with TBL/TBX-based four-stage SubBytes, 4-way and 8-way parallel AES block processing, a fixed-size PRG path specialized for the FAEST tree structure, and pthread-based batch-level parallelization of independent VOLE tasks. Evaluated on all 12 parameter sets of FAEST v2 on Raspberry Pi 4 and Apple M2, the combined optimization achieves speedups of up to $136.9\times$ and $330.1\times$, respectively, over the pure-C reference. On RPi4, the single-thread NEON implementation outperforms OpenSSL's software AES, and on M2, the full NEON-plus-pthread configuration outperforms the best available reference configuration, including hardware-accelerated OpenSSL, across all tested parameters.








