28 research outputs found
POCE: Pose-Controllable Expression Editing
Facial expression editing has attracted increasing attention with the advance
of deep neural networks in recent years. However, most existing methods suffer
from compromised editing fidelity and limited usability as they either ignore
pose variations (unrealistic editing) or require paired training data (not easy
to collect) for pose controls. This paper presents POCE, an innovative
pose-controllable expression editing network that can generate realistic facial
expressions and head poses simultaneously with just unpaired training images.
POCE achieves the more accessible and realistic pose-controllable expression
editing by mapping face images into UV space, where facial expressions and head
poses can be disentangled and edited separately. POCE has two novel designs.
The first is self-supervised UV completion that allows to complete UV maps
sampled under different head poses, which often suffer from self-occlusions and
missing facial texture. The second is weakly-supervised UV editing that allows
to generate new facial expressions with minimal modification of facial
identity, where the synthesized expression could be controlled by either an
expression label or directly transplanted from a reference UV map via feature
transfer. Extensive experiments show that POCE can learn from unpaired face
images effectively, and the learned model can generate realistic and
high-fidelity facial expressions under various new poses
Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations
Audio-driven talking face generation, which aims to synthesize talking faces
with realistic facial animations (including accurate lip movements, vivid
facial expression details and natural head poses) corresponding to the audio,
has achieved rapid progress in recent years. However, most existing work
focuses on generating lip movements only without handling the closely
correlated facial expressions, which degrades the realism of the generated
faces greatly. This paper presents DIRFA, a novel method that can generate
talking faces with diverse yet realistic facial animations from the same
driving audio. To accommodate fair variation of plausible facial animations for
the same audio, we design a transformer-based probabilistic mapping network
that can model the variational facial animation distribution conditioned upon
the input audio and autoregressively convert the audio signals into a facial
animation sequence. In addition, we introduce a temporally-biased mask into the
mapping network, which allows to model the temporal dependency of facial
animations and produce temporally smooth facial animation sequence. With the
generated facial animation sequence and a source image, photo-realistic talking
faces can be synthesized with a generic generation network. Extensive
experiments show that DIRFA can generate talking faces with realistic facial
animations effectively
Auto-regressive Image Synthesis with Integrated Quantization
Deep generative models have achieved conspicuous progress in realistic image
synthesis with multifarious conditional inputs, while generating diverse yet
high-fidelity images remains a grand challenge in conditional image generation.
This paper presents a versatile framework for conditional image generation
which incorporates the inductive bias of CNNs and powerful sequence modeling of
auto-regression that naturally leads to diverse image generation. Instead of
independently quantizing the features of multiple domains as in prior research,
we design an integrated quantization scheme with a variational regularizer that
mingles the feature discretization in multiple domains, and markedly boosts the
auto-regressive modeling performance. Notably, the variational regularizer
enables to regularize feature distributions in incomparable latent spaces by
penalizing the intra-domain variations of distributions. In addition, we design
a Gumbel sampling strategy that allows to incorporate distribution uncertainty
into the auto-regressive training procedure. The Gumbel sampling substantially
mitigates the exposure bias that often incurs misalignment between the training
and inference stages and severely impairs the inference performance. Extensive
experiments over multiple conditional image generation tasks show that our
method achieves superior diverse image generation performance qualitatively and
quantitatively as compared with the state-of-the-art.Comment: Accepted to ECCV 2022 as Oral Presentatio
EMLight: Lighting Estimation via Spherical Distribution Approximation
Illumination estimation from a single image is critical in 3D rendering and
it has been investigated extensively in the computer vision and computer
graphic research community. On the other hand, existing works estimate
illumination by either regressing light parameters or generating illumination
maps that are often hard to optimize or tend to produce inaccurate predictions.
We propose Earth Mover Light (EMLight), an illumination estimation framework
that leverages a regression network and a neural projector for accurate
illumination estimation. We decompose the illumination map into spherical light
distribution, light intensity and the ambient term, and define the illumination
estimation as a parameter regression task for the three illumination
components. Motivated by the Earth Mover distance, we design a novel spherical
mover's loss that guides to regress light distribution parameters accurately by
taking advantage of the subtleties of spherical distribution. Under the
guidance of the predicted spherical distribution, light intensity and ambient
term, the neural projector synthesizes panoramic illumination maps with
realistic light frequency. Extensive experiments show that EMLight achieves
accurate illumination estimation and the generated relighting in 3D object
embedding exhibits superior plausibility and fidelity as compared with
state-of-the-art methods.Comment: Accepted to AAAI 202
WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields
Neural Radiance Field (NeRF) has shown impressive performance in novel view
synthesis via implicit scene representation. However, it usually suffers from
poor scalability as requiring densely sampled images for each new scene.
Several studies have attempted to mitigate this problem by integrating
Multi-View Stereo (MVS) technique into NeRF while they still entail a
cumbersome fine-tuning process for new scenes. Notably, the rendering quality
will drop severely without this fine-tuning process and the errors mainly
appear around the high-frequency features. In the light of this observation, we
design WaveNeRF, which integrates wavelet frequency decomposition into MVS and
NeRF to achieve generalizable yet high-quality synthesis without any per-scene
optimization. To preserve high-frequency information when generating 3D feature
volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating
the discrete wavelet transform into the classical cascade MVS, which
disentangles high-frequency information explicitly. With that, disentangled
frequency features can be injected into classic NeRF via a novel hybrid neural
renderer to yield faithful high-frequency details, and an intuitive
frequency-guided sampling strategy can be designed to suppress artifacts around
high-frequency regions. Extensive experiments over three widely studied
benchmarks show that WaveNeRF achieves superior generalizable radiance field
modeling when only given three images as input.Comment: Accepted to ICCV 2023. Project website:
https://mxuai.github.io/WaveNeRF
Pose-Free Neural Radiance Fields via Implicit Pose Regularization
Pose-free neural radiance fields (NeRF) aim to train NeRF with unposed
multi-view images and it has achieved very impressive success in recent years.
Most existing works share the pipeline of training a coarse pose estimator with
rendered images at first, followed by a joint optimization of estimated poses
and neural radiance field. However, as the pose estimator is trained with only
rendered images, the pose estimation is usually biased or inaccurate for real
images due to the domain gap between real images and rendered images, leading
to poor robustness for the pose estimation of real images and further local
minima in joint optimization. We design IR-NeRF, an innovative pose-free NeRF
that introduces implicit pose regularization to refine pose estimator with
unposed real images and improve the robustness of the pose estimation for real
images. With a collection of 2D images of a specific scene, IR-NeRF constructs
a scene codebook that stores scene features and captures the scene-specific
pose distribution implicitly as priors. Thus, the robustness of pose estimation
can be promoted with the scene priors according to the rationale that a 2D real
image can be well reconstructed from the scene codebook only when its estimated
pose lies within the pose distribution. Extensive experiments show that IR-NeRF
achieves superior novel view synthesis and outperforms the state-of-the-art
consistently across multiple synthetic and real datasets.Comment: Accepted by ICCV202
GMLight: Lighting Estimation via Geometric Distribution Approximation
Lighting estimation from a single image is an essential yet challenging task
in computer vision and computer graphics. Existing works estimate lighting by
regressing representative illumination parameters or generating illumination
maps directly. However, these methods often suffer from poor accuracy and
generalization. This paper presents Geometric Mover's Light (GMLight), a
lighting estimation framework that employs a regression network and a
generative projector for effective illumination estimation. We parameterize
illumination scenes in terms of the geometric light distribution, light
intensity, ambient term, and auxiliary depth, and estimate them as a pure
regression task. Inspired by the earth mover's distance, we design a novel
geometric mover's loss to guide the accurate regression of light distribution
parameters. With the estimated lighting parameters, the generative projector
synthesizes panoramic illumination maps with realistic appearance and
frequency. Extensive experiments show that GMLight achieves accurate
illumination estimation and superior fidelity in relighting for 3D object
insertion.Comment: 12 pages, 11 figures. arXiv admin note: text overlap with
arXiv:2012.1111
Weakly Supervised 3D Open-vocabulary Segmentation
Open-vocabulary segmentation of 3D scenes is a fundamental function of human
perception and thus a crucial objective in computer vision research. However,
this task is heavily impeded by the lack of large-scale and diverse 3D
open-vocabulary segmentation datasets for training robust and generalizable
models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation
models helps but it compromises the open-vocabulary feature as the 2D models
are mostly finetuned with close-vocabulary datasets. We tackle the challenges
in 3D open-vocabulary segmentation by exploiting pre-trained foundation models
CLIP and DINO in a weakly supervised manner. Specifically, given only the
open-vocabulary text descriptions of the objects in a scene, we distill the
open-vocabulary multimodal knowledge and object reasoning capability of CLIP
and DINO into a neural radiance field (NeRF), which effectively lifts 2D
features into view-consistent 3D segmentation. A notable aspect of our approach
is that it does not require any manual segmentation annotations for either the
foundation models or the distillation process. Extensive experiments show that
our method even outperforms fully supervised models trained with segmentation
annotations in certain scenes, suggesting that 3D open-vocabulary segmentation
can be effectively learned from 2D images and text-image pairs. Code is
available at \url{https://github.com/Kunhao-Liu/3D-OVS}.Comment: Accepted to NeurIPS 202