Sound source localization is an important part of dealing with audio. People use it to help pay attention to someone talking to us in a noisy environment. Smart home speakers use it to identify when someone is speaking, to focus on their voice and reject the background noise.
We've built a new system for sound source localization, based on spiking neural networks (SNNs), that sets a new state-of-the-art for SNN implementations, is extremely power efficient, and even matches the accuracy of standard DSP-based approaches! [1] https://arxiv.org/abs/2402.11748
Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme

Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise. Localization is frequently accomplished by ``beamforming'', which combines phase-shifted audio streams to increase power from chosen source directions, under a known microphone array geometry. Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio. These approaches achieve high accuracy, but narrowband beamforming is computationally demanding, and not ideal for low-power IoT devices. We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs). We use a Hilbert transform to avoid dense band-pass filters, and introduce a new event-based encoding method that captures the phase of the complex analytic signal. Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming. We deploy our method to low-power SNN inference hardware, with much lower power consumption than super-resolution methods. We demonstrate that signal processing approaches co-designed with spiking neural network implementations can achieve much improved power efficiency. Our new Hilbert-transform-based method for beamforming can also improve the efficiency of traditional DSP-based signal processing.

arXiv.org
Mammals use the fact that audio sources from different directions lead to very precise differences in arrival time between our two ears — known as inter-aural time differences (ITDs). ITDs are encoded by the differences in neuronal spike times produced by our cochleas. [2]
Most SNN implementations of sound source localization take this approach, using the precise differences in spike times generated by a single-frequency sound at two microphones to estimate the location of an audio source.
We took a different approach, designed for arrays with many microphones (>2). We start with a construct called the Hilbert Transform to estimate the phase of each signals and to encode the phases as spikes. We then use a beamforming method to estimate the source direction.

The major benefit of our approach is that it works for *any* signal, e.g. wideband speech, and not just narrowband sine waves.

Beamforming works by "steering" the microphone array towards a chosen direction, by combining the audio signal from each microphone. Usually this is done by assuming a particular frequency for the source signal (the "narrowband" regime).

By using the Hilbert Transform we developed a single beamforming approach that works well in the narrowband case, and can use all frequencies of a wideband signal to work well in the wideband case!
As a result, we use *much* less implementation resources for beamforming than standard approaches. Using an SNN means we are also very power efficient, while still achieving state-of-the-art accuracy for SSNs, comparable with standard super-resolution methods such as MUSIC [3]!

If you're interested, you can read more details in our preprint on arXiv: https://arxiv.org/abs/2402.11748

And of course, our code is available open source: https://github.com/synsense/HaghighatshoarMuir2024

Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme

Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise. Localization is frequently accomplished by ``beamforming'', which combines phase-shifted audio streams to increase power from chosen source directions, under a known microphone array geometry. Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio. These approaches achieve high accuracy, but narrowband beamforming is computationally demanding, and not ideal for low-power IoT devices. We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs). We use a Hilbert transform to avoid dense band-pass filters, and introduce a new event-based encoding method that captures the phase of the complex analytic signal. Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming. We deploy our method to low-power SNN inference hardware, with much lower power consumption than super-resolution methods. We demonstrate that signal processing approaches co-designed with spiking neural network implementations can achieve much improved power efficiency. Our new Hilbert-transform-based method for beamforming can also improve the efficiency of traditional DSP-based signal processing.

arXiv.org
Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme

Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise. Localization is frequently accomplished by ``beamforming'', which combines phase-shifted audio streams to increase power from chosen source directions, under a known microphone array geometry. Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio. These approaches achieve high accuracy, but narrowband beamforming is computationally demanding, and not ideal for low-power IoT devices. We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs). We use a Hilbert transform to avoid dense band-pass filters, and introduce a new event-based encoding method that captures the phase of the complex analytic signal. Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming. We deploy our method to low-power SNN inference hardware, with much lower power consumption than super-resolution methods. We demonstrate that signal processing approaches co-designed with spiking neural network implementations can achieve much improved power efficiency. Our new Hilbert-transform-based method for beamforming can also improve the efficiency of traditional DSP-based signal processing.

arXiv.org