Mastodawn

Optimizing IRQ latency on the STM32H743 @ 480 MHz, perhaps for NES ROM emulation... Best result so far: 100 nanoseconds input-to-output latency when the vector table and the IRQ handler are relocated to Tightly-Coupled Memory without making HAL calls. Not bad, but the GPIO controller (several buses away) looks like the real performance killer here. WARNING: buggy code, see correction https://mk.absturztau.be/notes/ajvb448y305b01i4. #electronics #STM32

niconiconi Mar 15

Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. Just enabled i-cache and d-cache, and the IRQ latency dropped from 100 ns to 70 ns. 🚀 But cache shouldn't work like this. So my code is still touching slow memory somewhere. The stack perhaps, which is still in "normal" RAM. The slow Flash perhaps also makes it slower to abort main() if an instruction is stuck in a wait state. Need to check everything carefully... #electronics #STM32

niconiconi Mar 15

Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. The 70 ns vs. 100 ns overhead mystery solved. I did not correctly relocate the vector table to Tightly-Coupled Memory properly, it was still in Flash. The STM32 HAL macro USER_VECT_TAB_ADDRESS is a flag, not a memory address! In fact, only several hardcoded addresses are available, a real user override is not provided (the name "user" is a lie). Solution: just change VTOR manually, don't trust the startup code. I'm now getting 70-ns IRQ without CPU cache. #electronics #STM32

niconiconi Mar 16

I do not understand how the NES system bus works, even after reading multiple tutorials. Only one way to find out... #electronics #NES #NESdev

niconiconi 5d ago

Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. I decided to try an event loop using the WFE instruction instead of IRQs, and I managed to get 60 ns input-to-output latency. I suspect this is the best possible latency. Latency did not improve by abusing QSPI controller to generate a write request (in fact it slightly degraded), even if the QSPI controller is physically close to the CPU. Clearly, passively monitoring signals is not the way to go for bus emulation. Perhaps the solution is predicting the clock before it even arrives, by internally generating a phase-shifted version of it. #electronics #STM32

niconiconi 4d ago

Keep optimizing IRQ latency on the STM32H743 @ 480 MHz. My "zero-latency IRQ" idea is a success, now I'm getting a 17.30 ns "effective" latency! Upon receiving every rising edge of the clock, the hardware immediately starts a timer that fires after a programmed delay, calculated to be slightly before the next clock rising edge. This way, the firmware is triggered from recovered, phase-shifted version of the clock, a little bit like how analog NTSC TVs got their H/VSYNC. Interrupt latency is completely eliminated for all but the first clock cycle (which is also predictable with pre-enabled outputs, since it's always the reset vector) Perfect bus emulation starts looking feasible. #electronics #STM32

niconiconi 3d ago

Making a 60-pin Famicom debug cartridge for testing my cartridge emulator... #electronics #NES #NESdev

niconiconi 1d ago

"Warn : no flash bank found for address 0x08100000"Spent half an hour trying to figure out why can't OpenOCD see my upper flash bank, while claiming my STM32 is dual-banked at the same time. Solution: use stm32h7x_dual_bank.cfg, not stm32h7x.cfg.

blobcatfacepalm

#electronics #STM32

Rue Mohr 4d ago

@niconiconi ah time travel! the best way to deal with latency!

doragasu 4d ago

@niconiconi Oh, that's a great idea, so if I understand correctly, (and stripping all the details), on each clock you fire a timer that will cause an interrupt that will reach your code the exact moment the bus is accessed next time, right? Is there any noticeable jitter?

niconiconi 4d ago

@doragasu running in SRAM with an empty IRQ handler, there's a ~10 ns jitter.

external quantum efficiency 4d ago

@niconiconi @doragasu ahh I came here to wonder about jitter and there she is, beautiful

doragasu 4d ago

@niconiconi Those are awesome results 👍, thanks for sharing! I'd like to try something similar using a CH32, but I don't know from where am I going to get the time to do it 😅.

crzwdjk ✅ 4d ago

@niconiconi The 6502 does a few dummy bus cycles after reset before fetching the reset vector anyway so you should be able to catch even the read of the reset vector, according to this: https://www pagetable.com/?p=410

verifiedsabakan

disability_flag

@niconiconi doing a deep dive in NES?

niconiconi Mar 16

@puniko Trying to emulate the NES cartridge mapper ASIC + ROM + RAM hardware in software using a fast microcontroller.

verifiedsabakan

disability_flag

@niconiconi sounds fun

Graham Sutherland / Polynomial Mar 16

@niconiconi boioioioing

Marcus Müller 6d ago

@gsuberland @niconiconi well if that ain't a… ringing endorsement for proper bus termination and well-decoupled drivers!

niconiconi 6d ago

@funkylab @gsuberland Can you do any meaningful signal integrity assessment without a localized high-quality test point? Here, the oscilloscope probe has extremely poor ground connection, at the opposite side of the PCB to the power supply, via the oscilloscope probe from another channel. So I won't make any statement about the PCB's signal integrity.

Marcus Müller 6d ago

@niconiconi @gsuberland these a fair points!

verified_dragon

@gsuberland @niconiconi thank you

Now every time I see a signal, this is going to play in my head like a soundboard tied to a brain interrupt

doragasu 6d ago

@niconiconi Nice, will you do a writeup when you end your investigation? I love low level details of these old 8 and 16 bits systems!

Graham Sutherland / Polynomial Mar 15

@niconiconi I was about to ask if cache even matters for the stack and then realised it's probably the most important thing to cache unless the arch has some sort of SRAM block purely for the stack.

Marcus Müller Mar 15

@gsuberland @niconiconi SRAM for latency-critical memory: which TCM is!
Q: Does a M7 have the dual stack thing with a main SP and a process SP? in that case, it'd sound pretty doable to just have the ISR stack in TCM.

✧✦Catherine✦✧Mar 15

@niconiconi H743 has an interconnect designed by crackh^W^W quite questionably; I can't imagine using it for anything that latency sensitive over a much simpler MCU

niconiconi Mar 15

@whitequark My original plan was emulating all the memory mapper logic using only naive C code on the CPU for accessibility, so I just grabbed a random MCU devboard with high f_max and Flash space. When I read about the interconnect bottleneck in the H7, the board was already in transit. Now, I find myself with a board and I'm trying random things with it.

Rue Mohr Mar 15

@niconiconi :]
I have a silly question. If you poll the emulation, can you get faster performance? Sometimes its faster to poll the low latency part and IRQ the rest.

niconiconi Mar 15

@RueNahcMohr Yes, busy-polling is faster, limited only by the GPIO controller's clock and bus transaction latency.