121 research outputs found
Accurate Detection of Wake Word Start and End Using a CNN
Small footprint embedded devices require keyword spotters (KWS) with small
model size and detection latency for enabling voice assistants. Such a keyword
is often referred to as \textit{wake word} as it is used to wake up voice
assistant enabled devices. Together with wake word detection, accurate
estimation of wake word endpoints (start and end) is an important task of KWS.
In this paper, we propose two new methods for detecting the endpoints of wake
words in neural KWS that use single-stage word-level neural networks. Our
results show that the new techniques give superior accuracy for detecting wake
words' endpoints of up to 50 msec standard error versus human annotations, on
par with the conventional Acoustic Model plus HMM forced alignment. To our
knowledge, this is the first study of wake word endpoints detection methods for
single-stage neural KWS.Comment: Proceedings of INTERSPEEC
Spoken command recognition for robotics
In this thesis, I investigate spoken command recognition technology for robotics. While high
robustness is expected, the distant and noisy conditions in which the system has to operate
make the task very challenging. Unlike commercial systems which all rely on a "wake-up"
word to initiate the interaction, the pipeline proposed here directly detect and recognizes
commands from the continuous audio stream. In order to keep the task manageable despite
low-resource conditions, I propose to focus on a limited set of commands, thus trading off
flexibility of the system against robustness.
Domain and speaker adaptation strategies based on a multi-task regularization paradigm
are first explored. More precisely, two different methods are proposed which rely on a tied
loss function which penalizes the distance between the output of several networks. The first
method considers each speaker or domain as a task. A canonical task-independent network is
jointly trained with task-dependent models, allowing both types of networks to improve by
learning from one another. While an improvement of 3.2% on the frame error rate (FER) of
the task-independent network is obtained, this only partially carried over to the phone error
rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel
training of the canonical network with a privileged model having access to i-vectors. This
method proved less effective with only 1.2% of improvement on the FER.
In order to make the developed technology more accessible, I also investigated the use
of a sequence-to-sequence (S2S) architecture for command classification. The use of an
attention-based encoder-decoder model reduced the classification error by 40% relative to a
strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing
the relevance of S2S architectures in such context. In order to improve the flexibility of the
trained system, I also explored strategies for few-shot learning, which allow to extend the
set of commands with minimum requirements in terms of data. Retraining a model on the
combination of original and new commands, I managed to achieve 40.5% of accuracy on the
new commands with only 10 examples for each of them. This scores goes up to 81.5% of
accuracy with a larger set of 100 examples per new command. An alternative strategy, based
on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10
and 100 examples respectively, while being faster to train. This high performance is obtained
at the expense of the original categories though, on which the accuracy deteriorated. Those
results are very promising as the methods allow to easily extend an existing S2S model with
minimal resources.
Finally, a full spoken command recognition system (named iCubrec) has been developed
for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to
propose a fully hand-free experience. By segmenting only regions that are likely to contain
commands, the VAD module also allows to reduce greatly the computational cost of the
pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM
command recognition system for transcription. The VoCub dataset has been specifically
gathered to train a DNN-based acoustic model for our task. Through multi-condition training
with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model,
complemented by a rejection mechanism based on a confidence score, is finally added to the
system to reject non-command speech in a live demonstration of the system
Efficient Keyword Spotting by capturing long-range interactions with Temporal Lambda Networks
Models based on attention mechanisms have shown unprecedented speech
recognition performance. However, they are computationally expensive and
unnecessarily complex for keyword spotting, a task targeted to small-footprint
devices. This work explores the application of Lambda networks, an alternative
framework for capturing long-range interactions without attention, for the
keyword spotting task. We propose a novel \textit{ResNet}-based model by
swapping the residual blocks by temporal Lambda layers. Furthermore, the
proposed architecture is built upon uni-dimensional temporal convolutions that
further reduce its complexity. The presented model does not only reach
state-of-the-art accuracies on the Google Speech Commands dataset, but it is
85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15)
counterparts while being up to 100 times faster. To the best of our knowledge,
this is the first attempt to explore the Lambda framework within the speech
domain and therefore, we unravel further research of new interfaces based on
this architecture.Comment: speech recognition, keyword spotting, lambda network
- …