Search CORE

451 research outputs found

Deep Residual Learning for Small-Footprint Keyword Spotting

Author: Lin Jimmy
Tang Raphael
Publication venue
Publication date: 21/09/2018
Field of study

We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces.Comment: Published in ICASSP 201

arXiv.org e-Print Archive

Crossref

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Author: Jensen Jesper
López-Espejo Iván
Tan Zheng-Hua
Publication venue
Publication date: 26/06/2019
Field of study

Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of small electronic devices that allow interaction with them via speech. Often, KWS systems are speaker-independent, which means that any person --user or not-- might trigger them. For applications like KWS for hearing assistive devices this is unacceptable, as only the user must be allowed to handle them. In this paper we propose KWS for hearing assistive devices that is robust to external speakers. A state-of-the-art deep residual network for small-footprint KWS is regarded as a basis to build upon. By following a multi-task learning scheme, this system is extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters. For experiments, we generate from the Google Speech Commands Dataset a speech corpus emulating hearing aids as a capturing device. Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers

arXiv.org e-Print Archive

Crossref

VBN