122,140 research outputs found
Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis
Photorealistic frontal view synthesis from a single face image has a wide
range of applications in the field of face recognition. Although data-driven
deep learning methods have been proposed to address this problem by seeking
solutions from ample face data, this problem is still challenging because it is
intrinsically ill-posed. This paper proposes a Two-Pathway Generative
Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by
simultaneously perceiving global structures and local details. Four landmark
located patch networks are proposed to attend to local textures in addition to
the commonly used global encoder-decoder network. Except for the novel
architecture, we make this ill-posed problem well constrained by introducing a
combination of adversarial loss, symmetry loss and identity preserving loss.
The combined loss function leverages both frontal face distribution and
pre-trained discriminative deep face models to guide an identity preserving
inference of frontal views from profiles. Different from previous deep learning
methods that mainly rely on intermediate features for recognition, our method
directly leverages the synthesized identity preserving image for downstream
tasks like face recognition and attribution estimation. Experimental results
demonstrate that our method not only presents compelling perceptual results but
also outperforms state-of-the-art results on large pose face recognition.Comment: accepted at ICCV 2017, main paper & supplementary material, 11 page
Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis
An important problem for both graphics and vision is to synthesize novel
views of a 3D object from a single image. This is particularly challenging due
to the partial observability inherent in projecting a 3D object onto the image
space, and the ill-posedness of inferring object shape and pose. However, we
can train a neural network to address the problem if we restrict our attention
to specific object categories (in our case faces and chairs) for which we can
gather ample training data. In this paper, we propose a novel recurrent
convolutional encoder-decoder network that is trained end-to-end on the task of
rendering rotated objects starting from a single image. The recurrent structure
allows our model to capture long-term dependencies along a sequence of
transformations. We demonstrate the quality of its predictions for human faces
on the Multi-PIE dataset and for a dataset of 3D chair models, and also show
its ability to disentangle latent factors of variation (e.g., identity and
pose) without using full supervision.Comment: This was published in NIPS 2015 conferenc
Deferred Neural Rendering: Image Synthesis using Neural Textures
The modern computer graphics pipeline can synthesize images at remarkable
visual quality; however, it requires well-defined, high-quality 3D content as
input. In this work, we explore the use of imperfect 3D content, for instance,
obtained from photo-metric reconstructions with noisy and incomplete surface
geometry, while still aiming to produce photo-realistic (re-)renderings. To
address this challenging problem, we introduce Deferred Neural Rendering, a new
paradigm for image synthesis that combines the traditional graphics pipeline
with learnable components. Specifically, we propose Neural Textures, which are
learned feature maps that are trained as part of the scene capture process.
Similar to traditional textures, neural textures are stored as maps on top of
3D mesh proxies; however, the high-dimensional feature maps contain
significantly more information, which can be interpreted by our new deferred
neural rendering pipeline. Both neural textures and deferred neural renderer
are trained end-to-end, enabling us to synthesize photo-realistic images even
when the original 3D content was imperfect. In contrast to traditional,
black-box 2D generative neural networks, our 3D representation gives us
explicit control over the generated output, and allows for a wide range of
application domains. For instance, we can synthesize temporally-consistent
video re-renderings of recorded 3D scenes as our representation is inherently
embedded in 3D space. This way, neural textures can be utilized to coherently
re-render or manipulate existing video content in both static and dynamic
environments at real-time rates. We show the effectiveness of our approach in
several experiments on novel view synthesis, scene editing, and facial
reenactment, and compare to state-of-the-art approaches that leverage the
standard graphics pipeline as well as conventional generative neural networks.Comment: Video: https://youtu.be/z-pVip6WeyY SIGGRAPH 201
Do We Really Need to Collect Millions of Faces for Effective Face Recognition?
Face recognition capabilities have recently made extraordinary leaps. Though
this progress is at least partially due to ballooning training set sizes --
huge numbers of face images downloaded and labeled for identity -- it is not
clear if the formidable task of collecting so many images is truly necessary.
We propose a far more accessible means of increasing training data sizes for
face recognition systems. Rather than manually harvesting and labeling more
faces, we simply synthesize them. We describe novel methods of enriching an
existing dataset with important facial appearance variations by manipulating
the faces it contains. We further apply this synthesis approach when matching
query images represented using a standard convolutional neural network. The
effect of training and testing with synthesized images is extensively tested on
the LFW and IJB-A (verification and identification) benchmarks and Janus CS2.
The performances obtained by our approach match state of the art results
reported by systems trained on millions of downloaded images
MVF-Net: Multi-View 3D Face Morphable Model Regression
We address the problem of recovering the 3D geometry of a human face from a
set of facial images in multiple views. While recent studies have shown
impressive progress in 3D Morphable Model (3DMM) based facial reconstruction,
the settings are mostly restricted to a single view. There is an inherent
drawback in the single-view setting: the lack of reliable 3D constraints can
cause unresolvable ambiguities. We in this paper explore 3DMM-based shape
recovery in a different setting, where a set of multi-view facial images are
given as input. A novel approach is proposed to regress 3DMM parameters from
multi-view inputs with an end-to-end trainable Convolutional Neural Network
(CNN). Multiview geometric constraints are incorporated into the network by
establishing dense correspondences between different views leveraging a novel
self-supervised view alignment loss. The main ingredient of the view alignment
loss is a differentiable dense optical flow estimator that can backpropagate
the alignment errors between an input view and a synthetic rendering from
another input view, which is projected to the target view through the 3D shape
to be inferred. Through minimizing the view alignment loss, better 3D shapes
can be recovered such that the synthetic projections from one view to another
can better align with the observed image. Extensive experiments demonstrate the
superiority of the proposed method over other 3DMM methods.Comment: 2019 Conference on Computer Vision and Pattern Recognitio
Incremental Scene Synthesis
We present a method to incrementally generate complete 2D or 3D scenes with
the following properties: (a) it is globally consistent at each step according
to a learned scene prior, (b) real observations of a scene can be incorporated
while observing global consistency, (c) unobserved regions can be hallucinated
locally in consistence with previous observations, hallucinations and global
priors, and (d) hallucinations are statistical in nature, i.e., different
scenes can be generated from the same observations. To achieve this, we model
the virtual scene, where an active agent at each step can either perceive an
observed part of the scene or generate a local hallucination. The latter can be
interpreted as the agent's expectation at this step through the scene and can
be applied to autonomous navigation. In the limit of observing real data at
each point, our method converges to solving the SLAM problem. It can otherwise
sample entirely imagined scenes from prior distributions. Besides autonomous
agents, applications include problems where large data is required for building
robust real-world applications, but few samples are available. We demonstrate
efficacy on various 2D as well as 3D data
Deep View Morphing
Recently, convolutional neural networks (CNN) have been successfully applied
to view synthesis problems. However, such CNN-based methods can suffer from
lack of texture details, shape distortions, or high computational complexity.
In this paper, we propose a novel CNN architecture for view synthesis called
"Deep View Morphing" that does not suffer from these issues. To synthesize a
middle view of two input images, a rectification network first rectifies the
two input images. An encoder-decoder network then generates dense
correspondences between the rectified images and blending masks to predict the
visibility of pixels of the rectified images in the middle view. A view
morphing network finally synthesizes the middle view using the dense
correspondences and blending masks. We experimentally show the proposed method
significantly outperforms the state-of-the-art CNN-based view synthesis method.Comment: Accepted to CVPR 201
Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
Recent studies have shown remarkable advances in 3D human pose estimation
from monocular images, with the help of large-scale in-door 3D datasets and
sophisticated network architectures. However, the generalizability to different
environments remains an elusive goal. In this work, we propose a geometry-aware
3D representation for the human pose to address this limitation by using
multiple views in a simple auto-encoder model at the training stage and only 2D
keypoint information as supervision. A view synthesis framework is proposed to
learn the shared 3D representation between viewpoints with synthesizing the
human pose from one viewpoint to the other one. Instead of performing a direct
transfer in the raw image-level, we propose a skeleton-based encoder-decoder
mechanism to distil only pose-related representation in the latent space. A
learning-based representation consistency constraint is further introduced to
facilitate the robustness of latent 3D representation. Since the learnt
representation encodes 3D geometry information, mapping it to 3D pose will be
much easier than conventional frameworks that use an image or 2D coordinates as
the input of 3D pose estimator. We demonstrate our approach on the task of 3D
human pose estimation. Comprehensive experiments on three popular benchmarks
show that our model can significantly improve the performance of
state-of-the-art methods with simply injecting the representation as a robust
3D prior.Comment: Accepted as a CVPR 2019 oral paper. Project page:
https://kwanyeelin.github.io
LookinGood: Enhancing Performance Capture with Real-time Neural Re-Rendering
Motivated by augmented and virtual reality applications such as telepresence,
there has been a recent focus in real-time performance capture of humans under
motion. However, given the real-time constraint, these systems often suffer
from artifacts in geometry and texture such as holes and noise in the final
rendering, poor lighting, and low-resolution textures. We take the novel
approach to augment such real-time performance capture systems with a deep
architecture that takes a rendering from an arbitrary viewpoint, and jointly
performs completion, super resolution, and denoising of the imagery in
real-time. We call this approach neural (re-)rendering, and our live system
"LookinGood". Our deep architecture is trained to produce high resolution and
high quality images from a coarse rendering in real-time. First, we propose a
self-supervised training method that does not require manual ground-truth
annotation. We contribute a specialized reconstruction error that uses semantic
information to focus on relevant parts of the subject, e.g. the face. We also
introduce a salient reweighing scheme of the loss function that is able to
discard outliers. We specifically design the system for virtual and augmented
reality headsets where the consistency between the left and right eye plays a
crucial role in the final user experience. Finally, we generate temporally
stable results by explicitly minimizing the difference between two consecutive
frames. We tested the proposed system in two different scenarios: one involving
a single RGB-D sensor, and upper body reconstruction of an actor, the second
consisting of full body 360 degree capture. Through extensive experimentation,
we demonstrate how our system generalizes across unseen sequences and subjects.
The supplementary video is available at http://youtu.be/Md3tdAKoLGU.Comment: The supplementary video is available at: http://youtu.be/Md3tdAKoLGU
To be presented at SIGGRAPH Asia 201
Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation
Cross-view image translation is challenging because it involves images with
drastically different views and severe deformation. In this paper, we propose a
novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that
makes it possible to generate images of natural scenes in arbitrary viewpoints,
based on an image of the scene and a novel semantic map. The proposed
SelectionGAN explicitly utilizes the semantic information and consists of two
stages. In the first stage, the condition image and the target semantic map are
fed into a cycled semantic-guided generation network to produce initial coarse
results. In the second stage, we refine the initial results by using a
multi-channel attention selection mechanism. Moreover, uncertainty maps
automatically learned from attentions are used to guide the pixel loss for
better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top
datasets show that our model is able to generate significantly better results
than the state-of-the-art methods. The source code, data and trained models are
available at https://github.com/Ha0Tang/SelectionGAN.Comment: 20 pages, 16 figures, accepted to CVPR 2019 as an oral pape
- …