74,696 research outputs found
Lip Reading Sentences in the Wild
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available
End-to-end Lip-reading: A Preliminary Study
Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading
Word-level Persian Lipreading Dataset
Lip-reading has made impressive progress in recent years, driven by advances
in deep learning. Nonetheless, the prerequisite such advances is a suitable
dataset. This paper provides a new in-the-wild dataset for Persian word-level
lipreading containing 244,000 videos from approximately 1,800 speakers. We
evaluated the state-of-the-art method in this field and used a novel approach
for word-level lip-reading. In this method, we used the AV-HuBERT model for
feature extraction and obtained significantly better performance on our
dataset
Lip-reading with Densely Connected Temporal Convolutional Networks
In this work, we present the Densely Connected Temporal Convolutional Network
(DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional
Networks (TCN) have recently demonstrated great potential in many vision tasks,
its receptive fields are not dense enough to model the complex temporal
dynamics in lip-reading scenarios. To address this problem, we introduce dense
connections into the network to capture more robust temporal features.
Moreover, our approach utilises the Squeeze-and-Excitation block, a
light-weight attention mechanism, to further enhance the model's classification
power. Without bells and whistles, our DC-TCN method has achieved 88.36%
accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the
LRW-1000 dataset, which has surpassed all the baseline methods and is the new
state-of-the-art on both datasets.Comment: WACV 202
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos. Our key contributions are: (1) we compare two models
for lip reading, one using a CTC loss, and the other using a
sequence-to-sequence loss. Both models are built on top of the transformer
self-attention architecture; (2) we investigate to what extent lip reading is
complementary to audio speech recognition, especially when the audio signal is
noisy; (3) we introduce and publicly release a new dataset for audio-visual
speech recognition, LRS2-BBC, consisting of thousands of natural sentences from
British television. The models that we train surpass the performance of all
previous work on a lip reading benchmark dataset by a significant margin.Comment: Accepted for publication by IEEE Transactions on Pattern Analysis and
Machine Intelligenc
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
The large amount of audiovisual content being shared online today has drawn
substantial attention to the prospect of audiovisual self-supervised learning.
Recent works have focused on each of these modalities separately, while others
have attempted to model both simultaneously in a cross-modal fashion. However,
comparatively little attention has been given to leveraging one modality as a
training objective to learn from the other. In this work, we propose Learning
visual speech Representations from Audio via self-supervision (LiRA).
Specifically, we train a ResNet+Conformer model to predict acoustic features
from unlabelled visual speech. We find that this pre-trained model can be
leveraged towards word-level and sentence-level lip-reading through feature
extraction and fine-tuning experiments. We show that our approach significantly
outperforms other self-supervised methods on the Lip Reading in the Wild (LRW)
dataset and achieves state-of-the-art performance on Lip Reading Sentences 2
(LRS2) using only a fraction of the total labelled data.Comment: Accepted for publication at Interspeech 202
Reduced expression of C/EBPβ-LIP extends health- and lifespan in mice
Ageing is associated with physical decline and the development of age-related diseases such as metabolic disorders and cancer. Few conditions are known that attenuate the adverse effects of ageing, including calorie restriction (CR) and reduced signalling through the mechanistic target of rapamycin complex 1 (mTORC1) pathway. Synthesis of the metabolic transcription factor C/EBPβ-LIP is stimulated by mTORC1, which critically depends on a short upstream open reading frame (uORF) in the Cebpb-mRNA. Here we describe that reduced C/EBPβ-LIP expression due to genetic ablation of the uORF delays the development of age-associated phenotypes in mice. Moreover, female C/EBPβΔuORF mice display an extended lifespan. Since LIP levels increase upon aging in wild type mice, our data reveal an important role for C/EBPβ in the aging process and suggest that restriction of LIP expression sustains health and fitness. Thus, therapeutic strategies targeting C/EBPβ-LIP may offer new possibilities to treat age-related diseases and to prolong healthspan
- …