1,057 research outputs found
Instance-level Facial Attributes Transfer with Geometry-Aware Flow
We address the problem of instance-level facial attribute transfer without
paired training data, e.g. faithfully transferring the exact mustache from a
source face to a target face. This is a more challenging task than the
conventional semantic-level attribute transfer, which only preserves the
generic attribute style instead of instance-level traits. We propose the use of
geometry-aware flow, which serves as a well-suited representation for modeling
the transformation between instance-level facial attributes. Specifically, we
leverage the facial landmarks as the geometric guidance to learn the
differentiable flows automatically, despite of the large pose gap existed.
Geometry-aware flow is able to warp the source face attribute into the target
face context and generate a warp-and-blend result. To compensate for the
potential appearance gap between source and target faces, we propose a
hallucination sub-network that produces an appearance residual to further
refine the warp-and-blend result. Finally, a cycle-consistency framework
consisting of both attribute transfer module and attribute removal module is
designed, so that abundant unpaired face images can be used as training data.
Extensive evaluations validate the capability of our approach in transferring
instance-level facial attributes faithfully across large pose and appearance
gaps. Thanks to the flow representation, our approach can readily be applied to
generate realistic details on high-resolution images.Comment: To appear in AAAI 2019. Code and models are available at:
https://github.com/wdyin/GeoGA
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral
Presentation. Code, models, and video results are available on our webpage:
https://liuziwei7.github.io/projects/TalkingFace.htm
Deep Learning Face Attributes in the Wild
Predicting face attributes in the wild is challenging due to complex face
variations. We propose a novel deep learning framework for attribute prediction
in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly
with attribute tags, but pre-trained differently. LNet is pre-trained by
massive general object categories for face localization, while ANet is
pre-trained by massive face identities for attribute prediction. This framework
not only outperforms the state-of-the-art with a large margin, but also reveals
valuable facts on learning face representation.
(1) It shows how the performances of face localization (LNet) and attribute
prediction (ANet) can be improved by different pre-training strategies.
(2) It reveals that although the filters of LNet are fine-tuned only with
image-level attribute tags, their response maps over entire images have strong
indication of face locations. This fact enables training LNet for face
localization with only image-level annotations, but without face bounding boxes
or landmarks, which are required by all attribute recognition works.
(3) It also demonstrates that the high-level hidden neurons of ANet
automatically discover semantic concepts after pre-training with massive face
identities, and such concepts are significantly enriched after fine-tuning with
attribute tags. Each attribute can be well explained with a sparse linear
combination of these concepts.Comment: To appear in International Conference on Computer Vision (ICCV) 201
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
We present, GauHuman, a 3D human model with Gaussian Splatting for both fast
training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with
existing NeRF-based implicit representation modelling frameworks demanding
hours of training and seconds of rendering per frame. Specifically, GauHuman
encodes Gaussian Splatting in the canonical space and transforms 3D Gaussians
from canonical space to posed space with linear blend skinning (LBS), in which
effective pose and LBS refinement modules are designed to learn fine details of
3D humans under negligible computational cost. Moreover, to enable fast
optimization of GauHuman, we initialize and prune 3D Gaussians with 3D human
prior, while splitting/cloning via KL divergence guidance, along with a novel
merge operation for further speeding up. Extensive experiments on ZJU_Mocap and
MonoCap datasets demonstrate that GauHuman achieves state-of-the-art
performance quantitatively and qualitatively with fast training and real-time
rendering speed. Notably, without sacrificing rendering quality, GauHuman can
fast model the 3D human performer with ~13k 3D Gaussians.Comment: project page: https://skhu101.github.io/GauHuman/; code:
https://github.com/skhu101/GauHuma
Relighting4D: Neural Relightable Human from Videos
Human relighting is a highly desirable yet challenging task. Existing works
either require expensive one-light-at-a-time (OLAT) captured data using light
stage or cannot freely change the viewpoints of the rendered body. In this
work, we propose a principled framework, Relighting4D, that enables
free-viewpoints relighting from only human videos under unknown illuminations.
Our key insight is that the space-time varying geometry and reflectance of the
human body can be decomposed as a set of neural fields of normal, occlusion,
diffuse, and specular maps. These neural fields are further integrated into
reflectance-aware physically based rendering, where each vertex in the neural
field absorbs and reflects the light from the environment. The whole framework
can be learned from videos in a self-supervised manner, with physically
informed priors designed for regularization. Extensive experiments on both real
and synthetic datasets demonstrate that our framework is capable of relighting
dynamic human actors with free-viewpoints.Comment: ECCV 2022; Project Page
https://frozenburning.github.io/projects/relighting4d Codes are available at
https://github.com/FrozenBurning/Relighting4
Distributed Estimation and Inference with Statistical Guarantees
This paper studies hypothesis testing and parameter estimation in the context
of the divide and conquer algorithm. In a unified likelihood based framework,
we propose new test statistics and point estimators obtained by aggregating
various statistics from subsamples of size , where is the sample
size. In both low dimensional and high dimensional settings, we address the
important question of how to choose as grows large, providing a
theoretical upper bound on such that the information loss due to the divide
and conquer algorithm is negligible. In other words, the resulting estimators
have the same inferential efficiencies and estimation rates as a practically
infeasible oracle with access to the full sample. Thorough numerical results
are provided to back up the theory
- …