326 research outputs found
Wake Word Detection Based on Res2Net
This letter proposes a new wake word detection system based on Res2Net. As a
variant of ResNet, Res2Net was first applied to objection detection. Res2Net
realizes multiple feature scales by increasing possible receptive fields. This
multiple scaling mechanism significantly improves the detection ability of wake
words with different durations. Compared with the ResNet-based model, Res2Net
also significantly reduces the model size and is more suitable for detecting
wake words. The proposed system can determine the positions of wake words from
the audio stream without any additional assistance. The proposed method is
verified on the Mobvoi dataset containing two wake words. At a false alarm rate
of 0.5 per hour, the system reduced the false rejection of the two wake words
by more than 12% over prior works
Keyword Spotting System and Evaluation of Pruning and Quantization Methods on Low-power Edge Microcontrollers
Keyword spotting (KWS) is beneficial for voice-based user interactions with
low-power devices at the edge. The edge devices are usually always-on, so edge
computing brings bandwidth savings and privacy protection. The devices
typically have limited memory spaces, computational performances, power and
costs, for example, Cortex-M based microcontrollers. The challenge is to meet
the high computation and low-latency requirements of deep learning on these
devices. This paper firstly shows our small-footprint KWS system running on
STM32F7 microcontroller with Cortex-M7 core @216MHz and 512KB static RAM. Our
selected convolutional neural network (CNN) architecture has simplified number
of operations for KWS to meet the constraint of edge devices. Our baseline
system generates classification results for each 37ms including real-time audio
feature extraction part. This paper further evaluates the actual performance
for different pruning and quantization methods on microcontroller, including
different granularity of sparsity, skipping zero weights, weight-prioritized
loop order, and SIMD instruction. The result shows that for microcontrollers,
there are considerable challenges for accelerate unstructured pruned models,
and the structured pruning is more friendly than unstructured pruning. The
result also verified that the performance improvement for quantization and SIMD
instruction.Comment: Submitted to DCASE2022 Workshop. Code available at:
https://github.com/RoboBachelor/Keyword-Spotting-STM3
Spoken command recognition for robotics
In this thesis, I investigate spoken command recognition technology for robotics. While high
robustness is expected, the distant and noisy conditions in which the system has to operate
make the task very challenging. Unlike commercial systems which all rely on a "wake-up"
word to initiate the interaction, the pipeline proposed here directly detect and recognizes
commands from the continuous audio stream. In order to keep the task manageable despite
low-resource conditions, I propose to focus on a limited set of commands, thus trading off
flexibility of the system against robustness.
Domain and speaker adaptation strategies based on a multi-task regularization paradigm
are first explored. More precisely, two different methods are proposed which rely on a tied
loss function which penalizes the distance between the output of several networks. The first
method considers each speaker or domain as a task. A canonical task-independent network is
jointly trained with task-dependent models, allowing both types of networks to improve by
learning from one another. While an improvement of 3.2% on the frame error rate (FER) of
the task-independent network is obtained, this only partially carried over to the phone error
rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel
training of the canonical network with a privileged model having access to i-vectors. This
method proved less effective with only 1.2% of improvement on the FER.
In order to make the developed technology more accessible, I also investigated the use
of a sequence-to-sequence (S2S) architecture for command classification. The use of an
attention-based encoder-decoder model reduced the classification error by 40% relative to a
strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing
the relevance of S2S architectures in such context. In order to improve the flexibility of the
trained system, I also explored strategies for few-shot learning, which allow to extend the
set of commands with minimum requirements in terms of data. Retraining a model on the
combination of original and new commands, I managed to achieve 40.5% of accuracy on the
new commands with only 10 examples for each of them. This scores goes up to 81.5% of
accuracy with a larger set of 100 examples per new command. An alternative strategy, based
on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10
and 100 examples respectively, while being faster to train. This high performance is obtained
at the expense of the original categories though, on which the accuracy deteriorated. Those
results are very promising as the methods allow to easily extend an existing S2S model with
minimal resources.
Finally, a full spoken command recognition system (named iCubrec) has been developed
for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to
propose a fully hand-free experience. By segmenting only regions that are likely to contain
commands, the VAD module also allows to reduce greatly the computational cost of the
pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM
command recognition system for transcription. The VoCub dataset has been specifically
gathered to train a DNN-based acoustic model for our task. Through multi-condition training
with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model,
complemented by a rejection mechanism based on a confidence score, is finally added to the
system to reject non-command speech in a live demonstration of the system
- …