Keyword Spotting (KWS) models on embedded devices should adapt fast to new
user-defined words without forgetting previous ones. Embedded devices have
limited storage and computational resources, thus, they cannot save samples or
update large models. We consider the setup of embedded online continual
learning (EOCL), where KWS models with frozen backbone are trained to
incrementally recognize new words from a non-repeated stream of samples, seen
one at a time. To this end, we propose Temporal Aware Pooling (TAP) which
constructs an enriched feature space computing high-order moments of speech
features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a
Gaussian model for each class on the enriched feature space to effectively use
audio representations. In experimental analyses, TAP-SLDA outperforms
competitors on several setups, backbones, and baselines, bringing a relative
average gain of 11.3% on the GSC dataset.Comment: INTERSPEECH 202