5,801 research outputs found
Lip Reading Sentences in the Wild
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available
The Conversation: Deep Audio-Visual Speech Enhancement
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with
interactive demonstrations on
http://www.robots.ox.ac.uk/~vgg/demo/theconversatio
You said that?
We present a method for generating a video of a talking face. The method
takes as inputs: (i) still images of the target face, and (ii) an audio speech
segment; and outputs a video of the target face lip synched with the audio. The
method runs in real time and is applicable to faces and audio not seen at
training time.
To achieve this we propose an encoder-decoder CNN model that uses a joint
embedding of the face and audio to generate synthesised talking face video
frames. The model is trained on tens of hours of unlabelled videos.
We also show results of re-dubbing videos using speech from a different
person.Comment: https://youtu.be/LeufDSb15Kc British Machine Vision Conference
(BMVC), 201
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
Improving Access to Health Through Collaboration: Lessons Learned from The Colorado Trust's Partnerships for Health Initiative Evaluation
This report presents findings from the evaluation of four Partnerships in Health Initiative grantees that were addressing access to health in their communities through the formation of collaboratives. Outcomes achieved by the grantees as well as lessons learned for others embarking on collaborative processes are described
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor
Naturally Occurring Affect Predicts Verbal and Spatial Working Memory Performance
Some research has shown that induced affective states that vary in valence have
differential effects on verbal and spatial working memory performance, such that positive affect
improves verbal working memory and impairs spatial working memory, while negative affect
improves spatial working memory and impairs verbal working memory. However, other research
using similar mood induction and working memory tasks, has supported a nonspecific influence
of affect on working memory performance where fear impairs, and positive affect improves, both
verbal and spatial working memory. The present study investigated whether individual
differences in naturally occurring trait and state affect could predict verbal and spatial working
memory performance across six working memory tasks. Valence uniquely predicted working
memory performance over and above arousal and the interaction of valence and arousal which
were not significant predictors. Positive affect was associated with better WM performance,
while negative affect was associated with worse working memory performance. This pattern held
across both verbal and spatial working memory tasks, but was observed more strongly with 2-
back working memory tasks than with complex span working memory tasks. These findings
suggest that, in contrast to research demonstrating differential effects of affective states on verbal
and spatial working memory performance, naturally occurring affect demonstrates a modality
independent effect on working memory
- …