1,677 research outputs found
TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants
Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, we propose the use of a speech enhancement convolutional autoencoder, coupled with on-device keyword spotting, aimed at improving the trigger word detection in noisy environments. The end-to-end system learns by optimizing a linear combination of losses: a reconstruction-based loss, both at the log-mel spectrogram and at the waveform level, as well as a specific task loss that accounts for the cross-entropy error reported along the keyword spotting detection. We experiment with several neural network classifiers and report that deeply coupling the speech enhancement together with a wake-up word detector, e.g., by jointly training them, significantly improves the performance in the noisiest conditions. Additionally, we introduce a new publicly available speech database recorded for the Telefónica's voice assistant, Aura. The OK Aura Wake-up Word Dataset incorporates rich metadata, such as speaker demographics or room conditions, and comprises hard negative examples that were studiously selected to present different levels of phonetic similarity with respect to the trigger words 'OK Aura'. Keywords: speech enhancement; wake-up word; keyword spotting; deep learning; convolutional neural networ
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers
Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of
small electronic devices that allow interaction with them via speech. Often,
KWS systems are speaker-independent, which means that any person --user or
not-- might trigger them. For applications like KWS for hearing assistive
devices this is unacceptable, as only the user must be allowed to handle them.
In this paper we propose KWS for hearing assistive devices that is robust to
external speakers. A state-of-the-art deep residual network for small-footprint
KWS is regarded as a basis to build upon. By following a multi-task learning
scheme, this system is extended to jointly perform KWS and users'
own-voice/external speaker detection with a negligible increase in the number
of parameters. For experiments, we generate from the Google Speech Commands
Dataset a speech corpus emulating hearing aids as a capturing device. Our
results show that this multi-task deep residual network is able to achieve a
KWS accuracy relative improvement of around 32% with respect to a system that
does not deal with external speakers
iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation
In response to the increasing interest in human--machine communication across
various domains, this paper introduces a novel approach called iPhonMatchNet,
which addresses the challenge of barge-in scenarios, wherein user speech
overlaps with device playback audio, thereby creating a self-referencing
problem. The proposed model leverages implicit acoustic echo cancellation
(iAEC) techniques to increase the efficiency of user-defined keyword spotting
models, achieving a remarkable 95% reduction in mean absolute error with a
minimal increase in model size (0.13%) compared to the baseline model,
PhonMatchNet. We also present an efficient model structure and demonstrate its
capability to learn iAEC functionality without requiring a clean signal. The
findings of our study indicate that the proposed model achieves competitive
performance in real-world deployment conditions of smart devices.Comment: Submitted to ICASSP 202
Activating Digital Voice Assistant Without Activation Keyword
A digital voice assistant that understands and responds to natural language voice commands. To activate such a voice assistant, the user is required to utter an activation keyword. Such explicit activation requires the user to interrupt the natural flow of conversation, e.g., with others in the household. Many households are multilingual, e.g., where members of the household speak with each other in a particular language, but are capable of speaking other languages. This disclosure enables automatic activation of a digital voice assistant upon detection of a specific spoken language. For example, if the household members speak to each other in Polish, but are also capable of speaking English, the voice assistant can be set up to automatically detect spoken English and determine that a command was issued. The described techniques are applicable in any situation where users converse in a language different from the activation language of the voice assistant
- …