4 research outputs found
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers
Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of
small electronic devices that allow interaction with them via speech. Often,
KWS systems are speaker-independent, which means that any person --user or
not-- might trigger them. For applications like KWS for hearing assistive
devices this is unacceptable, as only the user must be allowed to handle them.
In this paper we propose KWS for hearing assistive devices that is robust to
external speakers. A state-of-the-art deep residual network for small-footprint
KWS is regarded as a basis to build upon. By following a multi-task learning
scheme, this system is extended to jointly perform KWS and users'
own-voice/external speaker detection with a negligible increase in the number
of parameters. For experiments, we generate from the Google Speech Commands
Dataset a speech corpus emulating hearing aids as a capturing device. Our
results show that this multi-task deep residual network is able to achieve a
KWS accuracy relative improvement of around 32% with respect to a system that
does not deal with external speakers
Exploring Filterbank Learning for Keyword Spotting
Despite their great performance over the years, handcrafted speech features
are not necessarily optimal for any particular speech application.
Consequently, with greater or lesser success, optimal filterbank learning has
been studied for different speech processing tasks. In this paper, we fill in a
gap by exploring filterbank learning for keyword spotting (KWS). Two approaches
are examined: filterbank matrix learning in the power spectral domain and
parameter learning of a psychoacoustically-motivated gammachirp filterbank.
Filterbank parameters are optimized jointly with a modern deep residual neural
network-based KWS back-end. Our experimental results reveal that, in general,
there are no statistically significant differences, in terms of KWS accuracy,
between using a learned filterbank and handcrafted speech features. Thus, while
we conclude that the latter are still a wise choice when using modern KWS
back-ends, we also hypothesize that this could be a symptom of information
redundancy, which opens up new research possibilities in the field of
small-footprint KWS