229 research outputs found
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
We propose a max-pooling based loss function for training Long Short-Term
Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low
CPU, memory, and latency requirements. The max-pooling loss training can be
further guided by initializing with a cross-entropy loss trained network. A
posterior smoothing based evaluation approach is employed to measure keyword
spotting performance. Our experimental results show that LSTM models trained
using cross-entropy loss or max-pooling loss outperform a cross-entropy loss
trained baseline feed-forward Deep Neural Network (DNN). In addition,
max-pooling loss trained LSTM with randomly initialized network performs better
compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss
trained LSTM initialized with a cross-entropy pre-trained network shows the
best performance, which yields relative reduction compared to baseline
feed-forward DNN in Area Under the Curve (AUC) measure
Attention-Based Models for Text-Dependent Speaker Verification
Attention-based models have recently shown great performance on a range of
tasks, such as speech recognition, machine translation, and image captioning
due to their ability to summarize relevant information that expands through the
entire length of an input sequence. In this paper, we analyze the usage of
attention mechanisms to the problem of sequence summarization in our end-to-end
text-dependent speaker recognition system. We explore different topologies and
their variants of the attention layer, and compare different pooling methods on
the attention weights. Ultimately, we show that attention-based models can
improves the Equal Error Rate (EER) of our speaker verification system by
relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
- …