258 research outputs found
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers
Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of
small electronic devices that allow interaction with them via speech. Often,
KWS systems are speaker-independent, which means that any person --user or
not-- might trigger them. For applications like KWS for hearing assistive
devices this is unacceptable, as only the user must be allowed to handle them.
In this paper we propose KWS for hearing assistive devices that is robust to
external speakers. A state-of-the-art deep residual network for small-footprint
KWS is regarded as a basis to build upon. By following a multi-task learning
scheme, this system is extended to jointly perform KWS and users'
own-voice/external speaker detection with a negligible increase in the number
of parameters. For experiments, we generate from the Google Speech Commands
Dataset a speech corpus emulating hearing aids as a capturing device. Our
results show that this multi-task deep residual network is able to achieve a
KWS accuracy relative improvement of around 32% with respect to a system that
does not deal with external speakers
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
Keyword spotting (KWS) constitutes a major component of human-technology
interfaces. Maximizing the detection accuracy at a low false alarm (FA) rate,
while minimizing the footprint size, latency and complexity are the goals for
KWS. Towards achieving them, we study Convolutional Recurrent Neural Networks
(CRNNs). Inspired by large-scale state-of-the-art speech recognition systems,
we combine the strengths of convolutional layers and recurrent layers to
exploit local structure and long-range context. We analyze the effect of
architecture parameters, and propose training strategies to improve
performance. With only ~230k parameters, our CRNN model yields acceptably low
latency, and achieves 97.71% accuracy at 0.5 FA/hour for 5 dB signal-to-noise
ratio.Comment: Accepted to Interspeech 201
- …