Spiking silicon cochlea sensors encode sound as an asynchronous stream of
spikes from different frequency channels. The lack of labeled training datasets
for spiking cochleas makes it difficult to train deep neural networks on the
outputs of these sensors. This work proposes a self-supervised method called
Temporal Network Grafting Algorithm (T-NGA), which grafts a recurrent network
pretrained on spectrogram features so that the network works with the cochlea
event features. T-NGA training requires only temporally aligned audio
spectrograms and event features. Our experiments show that the accuracy of the
grafted network was similar to the accuracy of a supervised network trained
from scratch on a speech recognition task using events from a software spiking
cochlea model. Despite the circuit non-idealities of the spiking silicon
cochlea, the grafted network accuracy on the silicon cochlea spike recordings
was only about 5% lower than the supervised network accuracy using the
N-TIDIGITS18 dataset. T-NGA can train networks to process spiking audio sensor
events in the absence of large labeled spike datasets.Comment: 5 pages, 4 figures; accepted at IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Singapore, 202