Most of the existing neural-based models for keyword spotting (KWS) in smart
devices require thousands of training samples to learn a decent audio
representation. However, with the rising demand for smart devices to become
more personalized, KWS models need to adapt quickly to smaller user samples. To
tackle this challenge, we propose a contrastive speech mixup (CosMix) learning
algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss
to the existing mixup augmentation technique to maximize the relative
similarity between the original pre-mixed samples and the augmented samples.
The goal is to inject enhancing constraints to guide the model towards simpler
but richer content-based speech representations from two augmented views (i.e.
noisy mixed and clean pre-mixed utterances). We conduct our experiments on the
Google Speech Command dataset, where we trim the size of the training set to as
small as 2.5 mins per keyword to simulate a low-resource condition. Our
experimental results show a consistent improvement in the performance of
multiple models, which exhibits the effectiveness of our method.Comment: Accepted by ICASSP 202