23 research outputs found
Audio-visual speech recognition with a hybrid CTC/attention architecture
Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the audio-only model and achieves the new state-of-the-art performance on LRS2 database (7% word error rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9% absolute improvement in word error rate) for several different types of noise as the signal-to-noise ratio decreases
Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Several audio-visual speech recognition models have been recently proposed
which aim to improve the robustness over audio-only models in the presence of
noise. However, almost all of them ignore the impact of the Lombard effect,
i.e., the change in speaking style in noisy environments which aims to make
speech more intelligible and affects both the acoustic characteristics of
speech and the lip movements. In this paper, we investigate the impact of the
Lombard effect in audio-visual speech recognition. To the best of our
knowledge, this is the first work which does so using end-to-end deep
architectures and presents results on unseen speakers. Our results show that
properly modelling Lombard speech is always beneficial. Even if a relatively
small amount of Lombard speech is added to the training set then the
performance in a real scenario, where noisy Lombard speech is present, can be
significantly improved. We also show that the standard approach followed in the
literature, where a model is trained and tested on noisy plain speech, provides
a correct estimate of the video-only performance and slightly underestimates
the audio-visual performance. In case of audio-only approaches, performance is
overestimated for SNRs higher than -3dB and underestimated for lower SNRs.Comment: Accepted for publication at Interspeech 201
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition
without requiring human annotated ground truth data. We achieve this by
distilling from an Automatic Speech Recognition (ASR) model that has been
trained on a large-scale audio-only corpus. We use a cross-modal distillation
method that combines Connectionist Temporal Classification (CTC) with a
frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that
ground truth transcriptions are not necessary to train a lip reading system;
(ii) we show how arbitrary amounts of unlabelled video data can be leveraged to
improve performance; (iii) we demonstrate that distillation significantly
speeds up training; and, (iv) we obtain state-of-the-art results on the
challenging LRS2 and LRS3 datasets for training only on publicly available
data.Comment: ICASSP 202
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
The large amount of audiovisual content being shared online today has drawn
substantial attention to the prospect of audiovisual self-supervised learning.
Recent works have focused on each of these modalities separately, while others
have attempted to model both simultaneously in a cross-modal fashion. However,
comparatively little attention has been given to leveraging one modality as a
training objective to learn from the other. In this work, we propose Learning
visual speech Representations from Audio via self-supervision (LiRA).
Specifically, we train a ResNet+Conformer model to predict acoustic features
from unlabelled visual speech. We find that this pre-trained model can be
leveraged towards word-level and sentence-level lip-reading through feature
extraction and fine-tuning experiments. We show that our approach significantly
outperforms other self-supervised methods on the Lip Reading in the Wild (LRW)
dataset and achieves state-of-the-art performance on Lip Reading Sentences 2
(LRS2) using only a fraction of the total labelled data.Comment: Accepted for publication at Interspeech 202