Optimizing IRQ latency on the STM32H743 @ 480 MHz, perhaps for NES ROM emulation... Best result so far: 100 nanoseconds input-to-output latency when the vector table and the IRQ handler are relocated to Tightly-Coupled Memory without making HAL calls. Not bad, but the GPIO controller (several buses away) looks like the real performance killer here. WARNING: buggy code, see correction https://mk.absturztau.be/notes/ajvb448y305b01i4. #electronics #STM32
Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. Just enabled i-cache and d-cache, and the IRQ latency dropped from 100 ns to 70 ns. 🚀 But cache shouldn't work like this. So my code is still touching slow memory somewhere. The stack perhaps, which is still in "normal" RAM. The slow Flash perhaps also makes it slower to abort main() if an instruction is stuck in a wait state. Need to check everything carefully... #electronics #STM32
Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. The 70 ns vs. 100 ns overhead mystery solved. I did not correctly relocate the vector table to Tightly-Coupled Memory properly, it was still in Flash. The STM32 HAL macro USER_VECT_TAB_ADDRESS is a flag, not a memory address! In fact, only several hardcoded addresses are available, a real user override is not provided (the name "user" is a lie). Solution: just change VTOR manually, don't trust the startup code. I'm now getting 70-ns IRQ without CPU cache. #electronics #STM32
I do not understand how the NES system bus works, even after reading multiple tutorials. Only one way to find out... #electronics #NES #NESdev
Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. I decided to try an event loop using the WFE instruction instead of IRQs, and I managed to get 60 ns input-to-output latency. I suspect this is the best possible latency. Latency did not improve by abusing QSPI controller to generate a write request (in fact it slightly degraded), even if the QSPI controller is physically close to the CPU. Clearly, passively monitoring signals is not the way to go for bus emulation. Perhaps the solution is predicting the clock before it even arrives, by internally generating a phase-shifted version of it. #electronics #STM32
Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. My "zero-latency IRQ" idea is a success, now I'm getting a 17.30 ns "effective" latency! Upon receiving every rising edge of the clock, the hardware immediately starts a timer that fires after a programmed delay, calculated to be slightly before the next clock rising edge. This way, the firmware is triggered from recovered, phase-shifted version of the clock, a little bit like how analog NTSC TVs got their H/VSYNC. Interrupt latency is completely eliminated for all but the first clock cycle (which is also predictable with pre-enabled outputs, since it's always the reset vector) Perfect bus emulation starts looking feasible. #electronics #STM32
Making a 60-pin Famicom debug cartridge for testing my cartridge emulator... #electronics #NES #NESdev

"Warn : no flash bank found for address 0x08100000"Spent half an hour trying to figure out why can't OpenOCD see my upper flash bank, while claiming my STM32 is dual-banked at the same time. Solution: use stm32h7x_dual_bank.cfg, not stm32h7x.cfg. ​#electronics #STM32

@niconiconi ah time travel! the best way to deal with latency!
@niconiconi Oh, that's a great idea, so if I understand correctly, (and stripping all the details), on each clock you fire a timer that will cause an interrupt that will reach your code the exact moment the bus is accessed next time, right? Is there any noticeable jitter?
@doragasu running in SRAM with an empty IRQ handler, there's a ~10 ns jitter.
@niconiconi @doragasu ahh I came here to wonder about jitter and there she is, beautiful
@niconiconi Those are awesome results 👍, thanks for sharing! I'd like to try something similar using a CH32, but I don't know from where am I going to get the time to do it 😅.
@niconiconi The 6502 does a few dummy bus cycles after reset before fetching the reset vector anyway so you should be able to catch even the read of the reset vector, according to this: https://www pagetable.com/?p=410
@niconiconi doing a deep dive in NES?
@puniko Trying to emulate the NES cartridge mapper ASIC + ROM + RAM hardware in software using a fast microcontroller.
@niconiconi boioioioing
@gsuberland @niconiconi well if that ain't a… ringing endorsement for proper bus termination and well-decoupled drivers!
@funkylab @gsuberland Can you do any meaningful signal integrity assessment without a localized high-quality test point? Here, the oscilloscope probe has extremely poor ground connection, at the opposite side of the PCB to the power supply, via the oscilloscope probe from another channel. So I won't make any statement about the PCB's signal integrity.

@gsuberland @niconiconi thank you

Now every time I see a signal, this is going to play in my head like a soundboard tied to a brain interrupt 

@niconiconi Nice, will you do a writeup when you end your investigation? I love low level details of these old 8 and 16 bits systems!
@niconiconi I was about to ask if cache even matters for the stack and then realised it's probably the most important thing to cache unless the arch has some sort of SRAM block purely for the stack.
@gsuberland @niconiconi SRAM for latency-critical memory: which TCM is!
Q: Does a M7 have the dual stack thing with a main SP and a process SP? in that case, it'd sound pretty doable to just have the ISR stack in TCM.
@niconiconi H743 has an interconnect designed by crackh^W^W quite questionably; I can't imagine using it for anything that latency sensitive over a much simpler MCU
@whitequark My original plan was emulating all the memory mapper logic using only naive C code on the CPU for accessibility, so I just grabbed a random MCU devboard with high f_max and Flash space. When I read about the interconnect bottleneck in the H7, the board was already in transit. Now, I find myself with a board and I'm trying random things with it. ​
@niconiconi :]
I have a silly question. If you poll the emulation, can you get faster performance? Sometimes its faster to poll the low latency part and IRQ the rest.
@RueNahcMohr Yes, busy-polling is faster, limited only by the GPIO controller's clock and bus transaction latency.