Moonshine is a really cool Speech to Text model optimised for super low latency on low resource devices like PI's and phones. About 100x less latency than Whisper.

Spent some time this weekend working on adding support for word level timestamps to it https://github.com/moonshine-ai/moonshine/pull/153

For context the first version of Moonshine took the Whisper model and optimised all the parts which made latency slow (the 30s fixed size audio window). V2 takes this further reducing the latency and size even more!
GitHub - moonshine-ai/moonshine: Fast and accurate automatic speech recognition (ASR) for edge devices

Fast and accurate automatic speech recognition (ASR) for edge devices - moonshine-ai/moonshine

GitHub

Interestingly while porting Moonshine Text to Speech to Swift and MLX I think I've worked out how to significantly boost it's accuracy over and above the original version.

I've managed to tweak the pipeline a bit and get the tiny 34M models word error rate to beat the medium 250m model.

It's consistent across a mix of synthetic text to speech generated audio and audio from some random YouTube videos.

Going to do some more tests with this, but if this holds will likely write this up and submit a new PR to the original repo.