1,065 research outputs found
Efficient keyword spotting using dilated convolutions and gating
We explore the application of end-to-end stateless temporal modeling to
small-footprint keyword spotting as opposed to recurrent networks that model
long-term temporal dependencies using internal states. We propose a model
inspired by the recent success of dilated convolutions in sequence modeling
applications, allowing to train deeper architectures in resource-constrained
configurations. Gated activations and residual connections are also added,
following a similar configuration to WaveNet. In addition, we apply a custom
target labeling that back-propagates loss from specific frames of interest,
therefore yielding higher accuracy and only requiring to detect the end of the
keyword. Our experimental results show that our model outperforms a max-pooling
loss trained recurrent neural network using LSTM cells, with a significant
decrease in false rejection rate. The underlying dataset - "Hey Snips"
utterances recorded by over 2.2K different speakers - has been made publicly
available to establish an open reference for wake-word detection.Comment: Accepted for publication to ICASSP 201
Domain Aware Training for Far-field Small-footprint Keyword Spotting
In this paper, we focus on the task of small-footprint keyword spotting under
the far-field scenario. Far-field environments are commonly encountered in
real-life speech applications, causing severe degradation of performance due to
room reverberation and various kinds of noises. Our baseline system is built on
the convolutional neural network trained with pooled data of both far-field and
close-talking speech. To cope with the distortions, we develop three domain
aware training systems, including the domain embedding system, the deep CORAL
system, and the multi-task learning system. These methods incorporate domain
knowledge into network training and improve the performance of the keyword
classifier on far-field conditions. Experimental results show that our proposed
methods manage to maintain the performance on the close-talking speech and
achieve significant improvement on the far-field test set.Comment: Submitted to INTERSPEECH 202
End-to-End Multi-Look Keyword Spotting
The performance of keyword spotting (KWS), measured in false alarms and false
rejects, degrades significantly under the far field and noisy conditions. In
this paper, we propose a multi-look neural network modeling for speech
enhancement which simultaneously steers to listen to multiple sampled look
directions. The multi-look enhancement is then jointly trained with KWS to form
an end-to-end KWS model which integrates the enhanced signals from multiple
look directions and leverages an attention mechanism to dynamically tune the
model's attention to the reliable sources. We demonstrate, on our large noisy
and far-field evaluation sets, that the proposed approach significantly
improves the KWS performance against the baseline KWS system and a recent
beamformer based multi-beam KWS system.Comment: Submitted to Interspeech202
End-to-End Streaming Keyword Spotting
We present a system for keyword spotting that, except for a frontend
component for feature generation, it is entirely contained in a deep neural
network (DNN) model trained "end-to-end" to predict the presence of the keyword
in a stream of audio. The main contributions of this work are, first, an
efficient memoized neural network topology that aims at making better use of
the parameters and associated computations in the DNN by holding a memory of
previous activations distributed over the depth of the DNN. The second
contribution is a method to train the DNN, end-to-end, to produce the keyword
spotting score. This system significantly outperforms previous approaches both
in terms of quality of detection as well as size and computation.Comment: Accepted in International Conference on Acoustics, Speech, and Signal
Processing 201
Sequence-to-sequence Models for Small-Footprint Keyword Spotting
In this paper, we propose a sequence-to-sequence model for keyword spotting
(KWS). Compared with other end-to-end architectures for KWS, our model
simplifies the pipelines of production-quality KWS system and satisfies the
requirement of high accuracy, low-latency, and small-footprint. We also
evaluate the performances of different encoder architectures, which include
LSTM and GRU. Experiments on the real-world wake-up data show that our approach
outperforms the recently proposed attention-based end-to-end model.
Specifically speaking, with 73K parameters, our sequence-to-sequence model
achieves 3.05\% false rejection rate (FRR) at 0.1 false alarm (FA) per
hour.Comment: Submitted to ICASSP 201
Federated Learning for Keyword Spotting
We propose a practical approach based on federated learning to solve
out-of-domain issues with continuously running embedded speech-based models
such as wake word detectors. We conduct an extensive empirical study of the
federated averaging algorithm for the "Hey Snips" wake word based on a
crowdsourced dataset that mimics a federation of wake word users. We
empirically demonstrate that using an adaptive averaging strategy inspired from
Adam in place of standard weighted model averaging highly reduces the number of
communication rounds required to reach our target performance. The associated
upstream communication costs per user are estimated at 8 MB, which is a
reasonable in the context of smart home voice assistants. Additionally, the
dataset used for these experiments is being open sourced with the aim of
fostering further transparent research in the application of federated learning
to speech data.Comment: Accepted for publication to ICASSP 201
JavaScript Convolutional Neural Networks for Keyword Spotting in the Browser: An Experimental Analysis
Used for simple commands recognition on devices from smart routers to mobile
phones, keyword spotting systems are everywhere. Ubiquitous as well are web
applications, which have grown in popularity and complexity over the last
decade with significant improvements in usability under cross-platform
conditions. However, despite their obvious advantage in natural language
interaction, voice-enabled web applications are still far and few between. In
this work, we attempt to bridge this gap by bringing keyword spotting
capabilities directly into the browser. To our knowledge, we are the first to
demonstrate a fully-functional implementation of convolutional neural networks
in pure JavaScript that runs in any standards-compliant browser. We also apply
network slimming, a model compression technique, to explore the
accuracy-efficiency tradeoffs, reporting latency measurements on a range of
devices and software. Overall, our robust, cross-device implementation for
keyword spotting realizes a new paradigm for serving neural network
applications, and one of our slim models reduces latency by 66% with a minimal
decrease in accuracy of 4% from 94% to 90%.Comment: 5 pages, 3 figure
End-to-end Models with auditory attention in Multi-channel Keyword Spotting
In this paper, we propose an attention-based end-to-end model for
multi-channel keyword spotting (KWS), which is trained to optimize the KWS
result directly. As a result, our model outperforms the baseline model with
signal pre-processing techniques in both the clean and noisy testing data. We
also found that multi-task learning results in a better performance when the
training and testing data are similar. Transfer learning and multi-target
spectral mapping can dramatically enhance the robustness to the noisy
environment. At 0.1 false alarm (FA) per hour, the model with transfer learning
and multi-target mapping gain an absolute 30% improvement in the wake-up rate
in the noisy data with SNR about -20.Comment: Submitted to ICASSP 201
Learning acoustic word embeddings with phonetically associated triplet network
Previous researches on acoustic word embeddings used in query-by-example
spoken term detection have shown remarkable performance improvements when using
a triplet network. However, the triplet network is trained using only a limited
information about acoustic similarity between words. In this paper, we propose
a novel architecture, phonetically associated triplet network (PATN), which
aims at increasing discriminative power of acoustic word embeddings by
utilizing phonetic information as well as word identity. The proposed model is
learned to minimize a combined loss function that was made by introducing a
cross entropy loss to the lower layer of LSTM-based triplet network. We
observed that the proposed method performs significantly better than the
baseline triplet network on a word discrimination task with the WSJ dataset
resulting in over 20% relative improvement in recall rate at 1.0 false alarm
per hour. Finally, we examined the generalization ability by conducting the
out-of-domain test on the RM dataset.Comment: 5 pages, 4 figures, submitted to ICASSP 201
Data Augmentation for Robust Keyword Spotting under Playback Interference
Accurate on-device keyword spotting (KWS) with low false accept and false
reject rate is crucial to customer experience for far-field voice control of
conversational agents. It is particularly challenging to maintain low false
reject rate in real world conditions where there is (a) ambient noise from
external sources such as TV, household appliances, or other speech that is not
directed at the device (b) imperfect cancellation of the audio playback from
the device, resulting in residual echo, after being processed by the Acoustic
Echo Cancellation (AEC) system. In this paper, we propose a data augmentation
strategy to improve keyword spotting performance under these challenging
conditions. The training set audio is artificially corrupted by mixing in music
and TV/movie audio, at different signal to interference ratios. Our results
show that we get around 30-45% relative reduction in false reject rates, at a
range of false alarm rates, under audio playback from such devices
- …