56,273 research outputs found
Domain-Specific Face Synthesis for Video Face Recognition from a Single Sample Per Person
The performance of still-to-video FR systems can decline significantly
because faces captured in unconstrained operational domain (OD) over multiple
video cameras have a different underlying data distribution compared to faces
captured under controlled conditions in the enrollment domain (ED) with a still
camera. This is particularly true when individuals are enrolled to the system
using a single reference still. To improve the robustness of these systems, it
is possible to augment the reference set by generating synthetic faces based on
the original still. However, without knowledge of the OD, many synthetic images
must be generated to account for all possible capture conditions. FR systems
may, therefore, require complex implementations and yield lower accuracy when
training on many less relevant images. This paper introduces an algorithm for
domain-specific face synthesis (DSFS) that exploits the representative
intra-class variation information available from the OD. Prior to operation, a
compact set of faces from unknown persons appearing in the OD is selected
through clustering in the captured condition space. The domain-specific
variations of these face images are projected onto the reference stills by
integrating an image-based face relighting technique inside the 3D
reconstruction framework. A compact set of synthetic faces is generated that
resemble individuals of interest under the capture conditions relevant to the
OD. In a particular implementation based on sparse representation
classification, the synthetic faces generated with the DSFS are employed to
form a cross-domain dictionary that account for structured sparsity.
Experimental results reveal that augmenting the reference gallery set of FR
systems using the proposed DSFS approach can provide a higher level of accuracy
compared to state-of-the-art approaches, with only a moderate increase in its
computational complexity
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral
Presentation. Code, models, and video results are available on our webpage:
https://liuziwei7.github.io/projects/TalkingFace.htm
- …