920 research outputs found
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
End-to-End Audiovisual Fusion with LSTMs
Several end-to-end deep learning approaches have been recently presented
which simultaneously extract visual features from the input images and perform
visual speech classification. However, research on jointly extracting audio and
visual features and performing classification is very limited. In this work, we
present an end-to-end audiovisual model based on Bidirectional Long Short-Term
Memory (BLSTM) networks. To the best of our knowledge, this is the first
audiovisual fusion model which simultaneously learns to extract features
directly from the pixels and spectrograms and perform classification of speech
and nonlinguistic vocalisations. The model consists of multiple identical
streams, one for each modality, which extract features directly from mouth
regions and spectrograms. The temporal dynamics in each stream/modality are
modeled by a BLSTM and the fusion of multiple streams/modalities takes place
via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4
nonlingusitic vocalisations over audio-only classification is reported on the
AVIC database. At the same time, the proposed end-to-end audiovisual fusion
system improves the state-of-the-art performance on the AVIC database leading
to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual
speech recognition experiments on the OuluVS2 database using different views of
the mouth, frontal to profile. The proposed audiovisual system significantly
outperforms the audio-only model for all views when the acoustic noise is high.Comment: Accepted to AVSP 2017. arXiv admin note: substantial text overlap
with arXiv:1709.00443 and text overlap with arXiv:1701.0584
Combining Multiple Views for Visual Speech Recognition
Visual speech recognition is a challenging research problem with a particular
practical application of aiding audio speech recognition in noisy scenarios.
Multiple camera setups can be beneficial for the visual speech recognition
systems in terms of improved performance and robustness. In this paper, we
explore this aspect and provide a comprehensive study on combining multiple
views for visual speech recognition. The thorough analysis covers fusion of all
possible view angle combinations both at feature level and decision level. The
employed visual speech recognition system in this study extracts features
through a PCA-based convolutional neural network, followed by an LSTM network.
Finally, these features are processed in a tandem system, being fed into a
GMM-HMM scheme. The decision fusion acts after this point by combining the
Viterbi path log-likelihoods. The results show that the complementary
information contained in recordings from different view angles improves the
results significantly. For example, the sentence correctness on the test set is
increased from 76% for the highest performing single view () to up to
83% when combining this view with the frontal and view angles
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
Silent speech interface is a promising technology that enables private
communications in natural language. However, previous approaches only support a
small and inflexible vocabulary, which leads to limited expressiveness. We
leverage contrastive learning to learn efficient lipreading representations,
enabling few-shot command customization with minimal user effort. Our model
exhibits high robustness to different lighting, posture, and gesture conditions
on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947
is achievable only using one shot, and its performance can be further boosted
by adaptively learning from more data. This generalizability allowed us to
develop a mobile silent speech interface empowered with on-device fine-tuning
and visual keyword spotting. A user study demonstrated that with LipLearner,
users could define their own commands with high reliability guaranteed by an
online incremental learning scheme. Subjective feedback indicated that our
system provides essential functionalities for customizable silent speech
interactions with high usability and learnability.Comment: Conditionally accepted to the ACM CHI Conference on Human Factors in
Computing Systems 2023 (CHI '23
- …