Kyle Howells

@iKyle
615 Followers
134 Following
5.4K Posts

Interestingly while porting Moonshine Text to Speech to Swift and MLX I think I've worked out how to significantly boost it's accuracy over and above the original version.

I've managed to tweak the pipeline a bit and get the tiny 34M models word error rate to beat the medium 250m model.

It's consistent across a mix of synthetic text to speech generated audio and audio from some random YouTube videos.

For context the first version of Moonshine took the Whisper model and optimised all the parts which made latency slow (the 30s fixed size audio window). V2 takes this further reducing the latency and size even more!

Moonshine is a really cool Speech to Text model optimised for super low latency on low resource devices like PI's and phones. About 100x less latency than Whisper.

Spent some time this weekend working on adding support for word level timestamps to it https://github.com/moonshine-ai/moonshine/pull/153

And the de-noise effect on a sample file I use full of static.
I've revised the UI a bit to put them side by side instead of all in a big list.

To process a 3mins audio clip takes the python Demucs library about 93s.
The python https://github.com/ssmall256/demucs-mlx/ library does it in 27s.

And I've managed to get my mlx swift version down to 16.5s and did a similar benchmark comparison audio results agains the other two libraries output.

Got the basic functionality working. Ported the Demucs python stemming library https://github.com/adefossez/demucs to Swift MLX. https://github.com/kylehowells/demucs-mlx-swift

Also setup a basic UI to show the different tracks, play them show waveforms + spectrogram's for each track, and allow exporting the audio track files.

DeepFilterNet Python library vs the MLX version (2.5s vs 1.1).

However, for my actual app I'm using a Swift + MLX version of the library with custom metal kernels for some bits of the pipeline, and small ops steps on the CPU with Accelerate instead of the GPU to avoid the delay sending to the GPU for a small task.

So I've got the Swift version down to 0.55s for processing a 52s audio file.

I benchmarked it against the original Rust and PyTorch versions to make sure the output quality was the same.

Got the output to match almost perfectly.

Though in terms of pure speed, the python lib runs highly optimised rust kernels so the MLX version processes 10s of audio in 2.5s, but the original does it in 1.2s.

Been experimenting with building MLX versions of some of my favourite python audio processing ML models and putting together an app to demo/use them.

I have a few projects I want to build them into but using this demo app as a test bed first. #buildinpublic