Search CORE

343 research outputs found

Visual units and confusion modelling for automatic lip-reading

Author: Cox Stephen
Howell Dominic
Theobald Barry
Publication venue: 'Elsevier BV'
Publication date: 01/07/2016
Field of study

Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy

University of East Anglia digital repository

Decoding visemes: improving machine lip-reading

Author: Bear Helen
Harvey Richard
Publication venue
Publication date: 24/06/2016
Field of study

To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance which significantly improves on previous lip-reading results

arXiv.org e-Print Archive

Crossref

University of East Anglia digital repository

A Mouth Full of Words: Visually Consistent Acoustic Redubbing

Author: Matthews Iain
Taylor Sarah
Theobald Barry-John
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/08/2015
Field of study

This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling

University of East Anglia digital repository

Text-based Editing of Talking-head Video

Author: Agrawala M.
Finkelstein A.
Fried O.
Genova K.
Goldman D.
Jin Z.
Shechtman E.
Tewari A.
Theobalt C.
Zollhöfer M.
Publication venue
Publication date: 01/01/2019
Field of study

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

MPG.PuRe