21 research outputs found
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a word
of interest is spoken by a talking face, with or without the audio. We propose
a zero-shot method suitable for in the wild videos. Our key contributions are:
(1) a novel convolutional architecture, KWS-Net, that uses a similarity map
intermediate representation to separate the task into (i) sequence matching,
and (ii) pattern detection, to decide whether the word is there and when; (2)
we demonstrate that if audio is available, visual keyword spotting improves the
performance both for a clean and noisy audio signal. Finally, (3) we show that
our method generalises to other languages, specifically French and German, and
achieves a comparable performance to English with less language specific data,
by fine-tuning the network pre-trained on English. The method exceeds the
performance of the previous state-of-the-art visual keyword spotting
architecture when trained and tested on the same benchmark, and also that of a
state-of-the-art lip reading method
Very Fast Keyword Spotting System with Real Time Factor below 0.01
In the paper we present an architecture of a keyword spotting (KWS) system
that is based on modern neural networks, yields good performance on various
types of speech data and can run very fast. We focus mainly on the last aspect
and propose optimizations for all the steps required in a KWS design: signal
processing and likelihood computation, Viterbi decoding, spot candidate
detection and confidence calculation. We present time and memory efficient
modelling by bidirectional feedforward sequential memory networks (an
alternative to recurrent nets) either by standard triphones or so called
quasi-monophones, and an entirely forward decoding of speech frames (with
minimal need for look back). Several variants of the proposed scheme are
evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17
hours in total) and their performance is compared by Detection Error Tradeoff
(DET) diagrams and real-time (RT) factors. We demonstrate that the complete
system can run in a single pass with a RT factor close to 0.001 if all
optimizations (including a GPU for likelihood computation) are applied.Comment: 11 pages, 3 figure
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS