Search CORE

1,677 research outputs found

TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants

Author: Bonet David
Cámbara Guillermo
Farrús Mireia
Gómez Pablo
Luque Jordi
López Fernando
Segura Carlos
Publication venue: 'MDPI AG'
Publication date: 09/03/2022
Field of study

Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, we propose the use of a speech enhancement convolutional autoencoder, coupled with on-device keyword spotting, aimed at improving the trigger word detection in noisy environments. The end-to-end system learns by optimizing a linear combination of losses: a reconstruction-based loss, both at the log-mel spectrogram and at the waveform level, as well as a specific task loss that accounts for the cross-entropy error reported along the keyword spotting detection. We experiment with several neural network classifiers and report that deeply coupling the speech enhancement together with a wake-up word detector, e.g., by jointly training them, significantly improves the performance in the noisiest conditions. Additionally, we introduce a new publicly available speech database recorded for the Telefónica's voice assistant, Aura. The OK Aura Wake-up Word Dataset incorporates rich metadata, such as speaker demographics or room conditions, and comprises hard negative examples that were studiously selected to present different levels of phonetic similarity with respect to the trigger words 'OK Aura'. Keywords: speech enhancement; wake-up word; keyword spotting; deep learning; convolutional neural networ

Diposit Digital de la Universitat de Barcelona

Deep Spoken Keyword Spotting:An Overview

Author: Espejo Ivan Lopez
Hansen John
Jensen Jesper
Tan Zheng-Hua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/11/2021
Field of study

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

arXiv.org e-Print Archive

VBN

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Author: Jensen Jesper
López-Espejo Iván
Tan Zheng-Hua
Publication venue
Publication date: 26/06/2019
Field of study

Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of small electronic devices that allow interaction with them via speech. Often, KWS systems are speaker-independent, which means that any person --user or not-- might trigger them. For applications like KWS for hearing assistive devices this is unacceptable, as only the user must be allowed to handle them. In this paper we propose KWS for hearing assistive devices that is robust to external speakers. A state-of-the-art deep residual network for small-footprint KWS is regarded as a basis to build upon. By following a multi-task learning scheme, this system is extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters. For experiments, we generate from the Google Speech Commands Dataset a speech corpus emulating hearing aids as a capturing device. Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers

arXiv.org e-Print Archive

Crossref

VBN

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Author: Jensen Jesper
Lopez-Espejo Ivan
Tan Zheng-Hua
Publication venue: 'International Speech Communication Association'
Publication date: 01/09/2019
Field of study

Crossref

VBN

iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation

Author: Cho Namhyun
Lee Yong-Hyeok
Publication venue
Publication date: 13/09/2023
Field of study

In response to the increasing interest in human--machine communication across various domains, this paper introduces a novel approach called iPhonMatchNet, which addresses the challenge of barge-in scenarios, wherein user speech overlaps with device playback audio, thereby creating a self-referencing problem. The proposed model leverages implicit acoustic echo cancellation (iAEC) techniques to increase the efficiency of user-defined keyword spotting models, achieving a remarkable 95% reduction in mean absolute error with a minimal increase in model size (0.13%) compared to the baseline model, PhonMatchNet. We also present an efficient model structure and demonstrate its capability to learn iAEC functionality without requiring a clean signal. The findings of our study indicate that the proposed model achieves competitive performance in real-world deployment conditions of smart devices.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Activating Digital Voice Assistant Without Activation Keyword

Author: Slowik Krzysztof
Publication venue: Technical Disclosure Commons
Publication date: 04/08/2020
Field of study

A digital voice assistant that understands and responds to natural language voice commands. To activate such a voice assistant, the user is required to utter an activation keyword. Such explicit activation requires the user to interrupt the natural flow of conversation, e.g., with others in the household. Many households are multilingual, e.g., where members of the household speak with each other in a particular language, but are capable of speaking other languages. This disclosure enables automatic activation of a digital voice assistant upon detection of a specific spoken language. For example, if the household members speak to each other in Polish, but are also capable of speaking English, the voice assistant can be set up to automatically detect spoken English and determine that a command was issued. The described techniques are applicable in any situation where users converse in a language different from the activation language of the voice assistant

Technical Disclosure Common