313 research outputs found
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a word
of interest is spoken by a talking face, with or without the audio. We propose
a zero-shot method suitable for in the wild videos. Our key contributions are:
(1) a novel convolutional architecture, KWS-Net, that uses a similarity map
intermediate representation to separate the task into (i) sequence matching,
and (ii) pattern detection, to decide whether the word is there and when; (2)
we demonstrate that if audio is available, visual keyword spotting improves the
performance both for a clean and noisy audio signal. Finally, (3) we show that
our method generalises to other languages, specifically French and German, and
achieves a comparable performance to English with less language specific data,
by fine-tuning the network pre-trained on English. The method exceeds the
performance of the previous state-of-the-art visual keyword spotting
architecture when trained and tested on the same benchmark, and also that of a
state-of-the-art lip reading method
What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision
We present a novel method for aligning a sequence of instructions to a video
of someone carrying out a task. In particular, we focus on the cooking domain,
where the instructions correspond to the recipe. Our technique relies on an HMM
to align the recipe steps to the (automatically generated) speech transcript.
We then refine this alignment using a state-of-the-art visual food detector,
based on a deep convolutional neural network. We show that our technique
outperforms simpler techniques based on keyword spotting. It also enables
interesting applications, such as automatically illustrating recipes with
keyframes, and searching within a video for events of interest.Comment: To appear in NAACL 201
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Recent progress in fine-grained gesture and action classification, and
machine translation, point to the possibility of automated sign language
recognition becoming a reality. A key stumbling block in making progress
towards this goal is a lack of appropriate training data, stemming from the
high complexity of sign annotation and a limited supply of qualified
annotators. In this work, we introduce a new scalable approach to data
collection for sign recognition in continuous videos. We make use of
weakly-aligned subtitles for broadcast footage together with a keyword spotting
method to automatically localise sign-instances for a vocabulary of 1,000 signs
in 1,000 hours of video. We make the following contributions: (1) We show how
to use mouthing cues from signers to obtain high-quality annotations from video
data - the result is the BSL-1K dataset, a collection of British Sign Language
(BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train
strong sign recognition models for co-articulated signs in BSL and that these
models additionally form excellent pretraining for other sign languages and
benchmarks - we exceed the state of the art on both the MSASL and WLASL
benchmarks. Finally, (3) we propose new large-scale evaluation sets for the
tasks of sign recognition and sign spotting and provide baselines which we hope
will serve to stimulate research in this area.Comment: Appears in: European Conference on Computer Vision 2020 (ECCV 2020).
28 page
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
Silent speech interface is a promising technology that enables private
communications in natural language. However, previous approaches only support a
small and inflexible vocabulary, which leads to limited expressiveness. We
leverage contrastive learning to learn efficient lipreading representations,
enabling few-shot command customization with minimal user effort. Our model
exhibits high robustness to different lighting, posture, and gesture conditions
on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947
is achievable only using one shot, and its performance can be further boosted
by adaptively learning from more data. This generalizability allowed us to
develop a mobile silent speech interface empowered with on-device fine-tuning
and visual keyword spotting. A user study demonstrated that with LipLearner,
users could define their own commands with high reliability guaranteed by an
online incremental learning scheme. Subjective feedback indicated that our
system provides essential functionalities for customizable silent speech
interactions with high usability and learnability.Comment: Conditionally accepted to the ACM CHI Conference on Human Factors in
Computing Systems 2023 (CHI '23
Watch, read and lookup: learning to spot signs from multiple supervisors
The focus of this work is sign spotting - given a video of an isolated sign,
our task is to identify whether and where it has been signed in a continuous,
co-articulated sign language video. To achieve this sign spotting task, we
train a model using multiple types of available supervision by: (1) watching
existing sparsely labelled footage; (2) reading associated subtitles (readily
available translations of the signed content) which provide additional
weak-supervision; (3) looking up words (for which no co-articulated labelled
examples are available) in visual sign language dictionaries to enable novel
sign spotting. These three tasks are integrated into a unified learning
framework using the principles of Noise Contrastive Estimation and Multiple
Instance Learning. We validate the effectiveness of our approach on low-shot
sign spotting benchmarks. In addition, we contribute a machine-readable British
Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to
facilitate study of this task. The dataset, models and code are available at
our project page.Comment: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) -
Oral presentation. 29 page
Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
Self-supervised learning (SSL) has emerged as a promising paradigm for
learning flexible speech representations from unlabeled data. By designing
pretext tasks that exploit statistical regularities, SSL models can capture
useful representations that are transferable to downstream tasks. This study
provides an empirical analysis of Barlow Twins (BT), an SSL technique inspired
by theories of redundancy reduction in human perception. On downstream tasks,
BT representations accelerated learning and transferred across domains.
However, limitations exist in disentangling key explanatory factors, with
redundancy reduction and invariance alone insufficient for factorization of
learned latents into modular, compact, and informative codes. Our ablations
study isolated gains from invariance constraints, but the gains were
context-dependent. Overall, this work substantiates the potential of Barlow
Twins for sample-efficient speech encoding. However, challenges remain in
achieving fully hierarchical representations. The analysis methodology and
insights pave a path for extensions incorporating further inductive priors and
perceptual principles to further enhance the BT self-supervision framework.Comment: 13 pages, 5 figures, in submission to MDPI Informatio
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Keyword localisation in untranscribed speech using visually grounded speech models
Keyword localisation is the task of finding where in a speech utterance a
given query keyword occurs. We investigate to what extent keyword localisation
is possible using a visually grounded speech (VGS) model. VGS models are
trained on unlabelled images paired with spoken captions. These models are
therefore self-supervised -- trained without any explicit textual label or
location information. To obtain training targets, we first tag training images
with soft text labels using a pretrained visual classifier with a fixed
vocabulary. This enables a VGS model to predict the presence of a written
keyword in an utterance, but not its location. We consider four ways to equip
VGS models with localisations capabilities. Two of these -- a saliency approach
and input masking -- can be applied to an arbitrary prediction model after
training, while the other two -- attention and a score aggregation approach --
are incorporated directly into the structure of the model. Masked-based
localisation gives some of the best reported localisation scores from a VGS
model, with an accuracy of 57% when the system knows that a keyword occurs in
an utterance and need to predict its location. In a setting where localisation
is performed after detection, an of 25% is achieved, and in a setting
where a keyword spotting ranking pass is first performed, we get a localisation
P@10 of 32%. While these scores are modest compared to the idealised setting
with unordered bag-of-word-supervision (from transcriptions), these models do
not receive any textual or location supervision. Further analyses show that
these models are limited by the first detection or ranking pass. Moreover,
individual keyword localisation performance is correlated with the tagging
performance from the visual classifier. We also show qualitatively how and
where semantic mistakes occur, e.g. that the model locates surfer when queried
with ocean.Comment: 10 figures, 5 table
- …