Search CORE

38 research outputs found

Lip-reading with Densely Connected Temporal Convolutional Networks

Author: Ma Pingchuan
Pantic Maja
Petridis Stavros
Shen Jie
Wang Yujiang
Publication venue
Publication date: 11/11/2020
Field of study

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.Comment: WACV 202

arXiv.org e-Print Archive

LiRA: learning visual speech representations from audio through self-supervision

Author: Ma Pingchuan
Mira Rodrigo
Pantic Maja
Petridis Stavros
Schuller Björn W.
Publication venue: 'International Speech Communication Association'
Publication date: 27/12/2021
Field of study

OPUS Augsburg

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

Author: Ma Pingchuan
Mira Rodrigo
Pantic Maja
Petridis Stavros
Schuller Björn W.
Publication venue
Publication date: 16/06/2021
Field of study

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.Comment: Accepted for publication at Interspeech 202

arXiv.org e-Print Archive

OPUS Augsburg