It is not well-documented, but it turns out that if you try to run crypto benchmark code on a STM32F407 ("discovery") board (with an ARM Cortex M4 CPU) with instructions in SRAM (instead of Flash), then you get extra delays, unless you set SYSCFG_MEMRMP accordingly. After a week-end of dabbling, I can now run my benchmarks (for Falcon/FN-DSA) at 168 MHz with no wait states or cache issues. Details here: https://github.com/pornin/c-fn-dsa/tree/main/bench_cm4
(As a side-note, I have done some more assembly optimization work, so signing cost is now 19.7 mcycles at n=512, down from 22.0 previously.)