237 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
Adhesion of veneering porcelain to alumina ceramic as determined by the strain energy release rate
The Impact of Attending Online Mindfulness Drop-In Sessions on Depression, Anxiety, Distress and Wellbeing in the General Population
Objectives: There is a lack of research into online mindfulness drop-in sessions (OMDIS)
that have been offered freely to the public, especially during the Covid-19 pandemic. These
sessions offer more flexibility than standard mindfulness-based interventions that run for a set
number of sessions, as individuals can ‘drop in’ to as many sessions as and when they like.
This research aimed to explore the impact of attending group facilitated OMDIS on
psychological outcomes in the general population.
Methods: A quantitative cross-sectional retrospective design was adopted in this study.
Participants (n=112) were recruited online through OMDIS providers in the UK and
internationally. Attendees were asked to complete an online survey with measures of
depression, anxiety, distress and wellbeing, both for their current state and retrospectively for
their state before attending any OMDIS. They also reported the number, duration and
frequency of sessions attended, as well as their ease and accuracy of retrospective recall.
Results: Paired T-tests and two-way repeated measures ANOVAs were conducted. Findings
indicated that: OMDIS were efficacious in improving depression, anxiety, distress and
wellbeing; attending more sessions, more frequently, for longer durations was not required to
attain these benefits; and being on a psychology waitlist or having prior mindfulness
experience did not lead to greater benefits, whereas having depression prior to attending
OMDIS did lead to greater improvements in psychological outcomes.
Conclusions: The current study is the first to explore and provide evidence for the efficacy of
OMDIS on psychological outcomes. OMDIS are cost-effective and readily available and
therefore could be offered to those on waiting lists for psychological interventions, who often
wait prolonged periods without any support. Further research is needed to understand other
factors that may impact efficacy in order to maximise the utility of OMDIS.
Keywords: Online, mindfulness, drop-in, mental health, depression, anxiet
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor
From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script
The goal of this paper is the automatic identification of characters in TV
and feature film material. In contrast to standard approaches to this task,
which rely on the weak supervision afforded by transcripts and subtitles, we
propose a new method requiring only a cast list. This list is used to obtain
images of actors from freely available sources on the web, providing a form of
partial supervision for this task. In using images of actors to recognize
characters, we make the following three contributions: (i) We demonstrate that
an automated semi-supervised learning approach is able to adapt from the
actor's face to the character's face, including the face context of the hair;
(ii) By building voice models for every character, we provide a bridge between
frontal faces (for which there is plenty of actor-level supervision) and
profile (for which there is very little or none); and (iii) by combining face
context and speaker identification, we are able to identify characters with
partially occluded faces and extreme facial poses. Results are presented on the
TV series 'Sherlock' and the feature film 'Casablanca'. We achieve the
state-of-the-art on the Casablanca benchmark, surpassing previous methods that
have used the stronger supervision available from transcripts
Seeing Voices and Hearing Faces: Cross-modal biometric matching
We introduce a seemingly impossible task: given only an audio clip of someone
speaking, decide which of two face images is the speaker. In this paper we
study this, and a number of related cross-modal tasks, aimed at answering the
question: how much can we infer from the voice about the face and vice versa?
We study this task "in the wild", employing the datasets that are now publicly
available for face recognition from static images (VGGFace) and speaker
identification from audio (VoxCeleb). These provide training and testing
scenarios for both static and dynamic testing of cross-modal matching. We make
the following contributions: (i) we introduce CNN architectures for both binary
and multi-way cross-modal face and audio matching, (ii) we compare dynamic
testing (where video information is available, but the audio is not from the
same video) with static testing (where only a single still image is available),
and (iii) we use human testing as a baseline to calibrate the difficulty of the
task. We show that a CNN can indeed be trained to solve this task in both the
static and dynamic scenarios, and is even well above chance on 10-way
classification of the face given the voice. The CNN matches human performance
on easy examples (e.g. different gender across faces) but exceeds human
performance on more challenging examples (e.g. faces with the same gender, age
and nationality).Comment: To appear in: IEEE Computer Vision and Pattern Recognition (CVPR),
201
- …