Search CORE

5,801 research outputs found

Lip Reading Sentences in the Wild

Author: Chung Joon Son
Senior Andrew
Vinyals Oriol
Zisserman Andrew
Publication venue
Publication date: 01/01/2017
Field of study

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Teknik dua paras terimbuh untuk kawalan optimum penyesuaian keadaan pegun proses tunggal

Author: Andrew Yap Kian Chung
Marzuki Mustafa
Publication venue
Publication date: 01/01/2007
Field of study

UKM Journal Article Repository

The Conversation: Deep Audio-Visual Speech Enhancement

Author: Afouras Triantafyllos
Chung Joon Son
Zisserman Andrew
Publication venue
Publication date: 01/01/2018
Field of study

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with interactive demonstrations on http://www.robots.ox.ac.uk/~vgg/demo/theconversatio

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

You said that?

Author: Chung Joon Son
Jamaludin Amir
Zisserman Andrew
Publication venue
Publication date: 01/01/2017
Field of study

We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on tens of hours of unlabelled videos. We also show results of re-dubbing videos using speech from a different person.Comment: https://youtu.be/LeufDSb15Kc British Machine Vision Conference (BMVC), 201

arXiv.org e-Print Archive

Oxford University Research Archive

VoxCeleb2: Deep Speaker Recognition

Author: Chung Joon Son
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 26/06/2018
Field of study

The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 page

arXiv.org e-Print Archive

Oxford University Research Archive

Improving Access to Health Through Collaboration: Lessons Learned from The Colorado Trust's Partnerships for Health Initiative Evaluation

Author: Andrew Keller
Chris Armijo
David Bartsch
Phillip Chung
Publication venue: Colorado Trust
Publication date: 11/11/2012
Field of study

This report presents findings from the evaluation of four Partnerships in Health Initiative grantees that were addressing access to health in their communities through the formation of collaboratives. Outcomes achieved by the grantees as well as lessons learned for others embarking on collaborative processes are described

IssueLab

The Determinants Of Teacher Supply: Time Series Evidence For The UK, 1962-2001

Author: Andrew Tremayne
Peter Dolton
Tsung Ping Chung
Publication venue
Publication date
Field of study

Research Papers in Economics

Disentangled Speech Embeddings using Cross-modal Self-supervision

Author: Albanie Samuel
Chung Joon Son
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 01/01/2020
Field of study

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart--without annotation--the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Naturally Occurring Affect Predicts Verbal and Spatial Working Memory Performance

Author: Chung Andrew
Publication venue: 'Brock University Library'
Publication date: 10/09/2018
Field of study

Some research has shown that induced affective states that vary in valence have differential effects on verbal and spatial working memory performance, such that positive affect improves verbal working memory and impairs spatial working memory, while negative affect improves spatial working memory and impairs verbal working memory. However, other research using similar mood induction and working memory tasks, has supported a nonspecific influence of affect on working memory performance where fear impairs, and positive affect improves, both verbal and spatial working memory. The present study investigated whether individual differences in naturally occurring trait and state affect could predict verbal and spatial working memory performance across six working memory tasks. Valence uniquely predicted working memory performance over and above arousal and the interaction of valence and arousal which were not significant predictors. Positive affect was associated with better WM performance, while negative affect was associated with worse working memory performance. This pattern held across both verbal and spatial working memory tasks, but was observed more strongly with 2- back working memory tasks than with complex span working memory tasks. These findings suggest that, in contrast to research demonstrating differential effects of affective states on verbal and spatial working memory performance, naturally occurring affect demonstrates a modality independent effect on working memory

Brock University Digital Repository