17 research outputs found
Exploring Filterbank Learning for Keyword Spotting
Despite their great performance over the years, handcrafted speech features
are not necessarily optimal for any particular speech application.
Consequently, with greater or lesser success, optimal filterbank learning has
been studied for different speech processing tasks. In this paper, we fill in a
gap by exploring filterbank learning for keyword spotting (KWS). Two approaches
are examined: filterbank matrix learning in the power spectral domain and
parameter learning of a psychoacoustically-motivated gammachirp filterbank.
Filterbank parameters are optimized jointly with a modern deep residual neural
network-based KWS back-end. Our experimental results reveal that, in general,
there are no statistically significant differences, in terms of KWS accuracy,
between using a learned filterbank and handcrafted speech features. Thus, while
we conclude that the latter are still a wise choice when using modern KWS
back-ends, we also hypothesize that this could be a symptom of information
redundancy, which opens up new research possibilities in the field of
small-footprint KWS
Efficient keyword spotting by capturing long-range interactions with temporal lambda networks
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel ResNet-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (ResNet15) counterparts while being up to 100× faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and therefore, we unravel further research of new interfaces based on this architecture.Peer ReviewedPostprint (author's final draft
Efficient Keyword Spotting by capturing long-range interactions with Temporal Lambda Networks
Models based on attention mechanisms have shown unprecedented speech
recognition performance. However, they are computationally expensive and
unnecessarily complex for keyword spotting, a task targeted to small-footprint
devices. This work explores the application of Lambda networks, an alternative
framework for capturing long-range interactions without attention, for the
keyword spotting task. We propose a novel \textit{ResNet}-based model by
swapping the residual blocks by temporal Lambda layers. Furthermore, the
proposed architecture is built upon uni-dimensional temporal convolutions that
further reduce its complexity. The presented model does not only reach
state-of-the-art accuracies on the Google Speech Commands dataset, but it is
85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15)
counterparts while being up to 100 times faster. To the best of our knowledge,
this is the first attempt to explore the Lambda framework within the speech
domain and therefore, we unravel further research of new interfaces based on
this architecture.Comment: speech recognition, keyword spotting, lambda network
Filterbank Learning for Small-Footprint Keyword Spotting Robust to Noise
In the context of keyword spotting (KWS), the replacement of handcrafted
speech features by learnable features has not yielded superior KWS performance.
In this study, we demonstrate that filterbank learning outperforms handcrafted
speech features for KWS whenever the number of filterbank channels is severely
decreased. Reducing the number of channels might yield certain KWS performance
drop, but also a substantial energy consumption reduction, which is key when
deploying common always-on KWS on low-resource devices. Experimental results on
a noisy version of the Google Speech Commands Dataset show that filterbank
learning adapts to noise characteristics to provide a higher degree of
robustness to noise, especially when dropout is integrated. Thus, switching
from typically used 40-channel log-Mel features to 8-channel learned features
leads to a relative KWS accuracy loss of only 3.5% while simultaneously
achieving a 6.3x energy consumption reduction
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords
This study presents a novel zero-shot user-defined keyword spotting model
that utilizes the audio-phoneme relationship of the keyword to improve
performance. Unlike the previous approach that estimates at utterance level, we
use both utterance and phoneme level information. Our proposed method comprises
a two-stream speech encoder architecture, self-attention-based pattern
extractor, and phoneme-level detection loss for high performance in various
pronunciation environments. Based on experimental results, our proposed model
outperforms the baseline model and achieves competitive performance compared
with full-shot keyword spotting models. Our proposed model significantly
improves the EER and AUC across all datasets, including familiar words, proper
nouns, and indistinguishable pronunciations, with an average relative
improvement of 67% and 80%, respectively. The implementation code of our
proposed model is available at https://github.com/ncsoft/PhonMatchNet