7,422 research outputs found
DeepVoxels: Learning Persistent 3D Feature Embeddings
In this work, we address the lack of 3D understanding of generative neural
networks by introducing a persistent 3D feature embedding for view synthesis.
To this end, we propose DeepVoxels, a learned representation that encodes the
view-dependent appearance of a 3D scene without having to explicitly model its
geometry. At its core, our approach is based on a Cartesian 3D grid of
persistent embedded features that learn to make use of the underlying 3D scene
structure. Our approach combines insights from 3D geometric computer vision
with recent advances in learning image-to-image mappings based on adversarial
loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction
of the scene, using a 2D re-rendering loss and enforces perspective and
multi-view geometry in a principled manner. We apply our persistent 3D scene
representation to the problem of novel view synthesis demonstrating
high-quality results for a variety of challenging scenes.Comment: Video: https://www.youtube.com/watch?v=HM_WsZhoGXw Supplemental
material:
https://drive.google.com/file/d/1BnZRyNcVUty6-LxAstN83H79ktUq8Cjp/view?usp=sharing
Code: https://github.com/vsitzmann/deepvoxels Project page:
https://vsitzmann.github.io/deepvoxels
In-N-Out: Face Video Inversion and Editing with Volumetric Decomposition
3D-aware GANs offer new capabilities for creative content editing, such as
view synthesis, while preserving the editing capability of their 2D
counterparts. These methods use GAN inversion to reconstruct images or videos
by optimizing a latent code, allowing for semantic editing by manipulating the
code. However, a model pre-trained on a face dataset (e.g., FFHQ) often has
difficulty handling faces with out-of-distribution (OOD) objects, e.g., heavy
make-up or occlusions. We address this issue by explicitly modeling OOD objects
in face videos. Our core idea is to represent the face in a video using two
neural radiance fields, one for the in-distribution and the other for the
out-of-distribution object, and compose them together for reconstruction. Such
explicit decomposition alleviates the inherent trade-off between reconstruction
fidelity and editability. We evaluate our method's reconstruction accuracy and
editability on challenging real videos and showcase favorable results against
other baselines.Comment: Project page: https://in-n-out-3d.github.io
Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement
Human perception reliably identifies movable and immovable parts of 3D
scenes, and completes the 3D structure of objects and background from
incomplete observations. We learn this skill not via labeled examples, but
simply by observing objects move. In this work, we propose an approach that
observes unlabeled multi-view videos at training time and learns to map a
single image observation of a complex scene, such as a street with cars, to a
3D neural scene representation that is disentangled into movable and immovable
parts while plausibly completing its 3D structure. We separately parameterize
movable and immovable scene parts via 2D neural ground plans. These ground
plans are 2D grids of features aligned with the ground plane that can be
locally decoded into 3D neural radiance fields. Our model is trained
self-supervised via neural rendering. We demonstrate that the structure
inherent to our disentangled 3D representation enables a variety of downstream
tasks in street-scale 3D scenes using simple heuristics, such as extraction of
object-centric 3D representations, novel view synthesis, instance segmentation,
and 3D bounding box prediction, highlighting its value as a backbone for
data-efficient 3D scene understanding models. This disentanglement further
enables scene editing via object manipulation such as deletion, insertion, and
rigid-body motion.Comment: Project page: https://prafullsharma.net/see3d
Neural Radiance Fields: Past, Present, and Future
The various aspects like modeling and interpreting 3D environments and
surroundings have enticed humans to progress their research in 3D Computer
Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall
et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in
Computer Graphics, Robotics, Computer Vision, and the possible scope of
High-Resolution Low Storage Augmented Reality and Virtual Reality-based 3D
models have gained traction from res with more than 1000 preprints related to
NeRFs published. This paper serves as a bridge for people starting to study
these fields by building on the basics of Mathematics, Geometry, Computer
Vision, and Computer Graphics to the difficulties encountered in Implicit
Representations at the intersection of all these disciplines. This survey
provides the history of rendering, Implicit Learning, and NeRFs, the
progression of research on NeRFs, and the potential applications and
implications of NeRFs in today's world. In doing so, this survey categorizes
all the NeRF-related research in terms of the datasets used, objective
functions, applications solved, and evaluation criteria for these applications.Comment: 413 pages, 9 figures, 277 citation
MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
In this work we propose a novel model-based deep convolutional autoencoder
that addresses the highly challenging problem of reconstructing a 3D human face
from a single in-the-wild color image. To this end, we combine a convolutional
encoder network with an expert-designed generative model that serves as
decoder. The core innovation is our new differentiable parametric decoder that
encapsulates image formation analytically based on a generative model. Our
decoder takes as input a code vector with exactly defined semantic meaning that
encodes detailed face pose, shape, expression, skin reflectance and scene
illumination. Due to this new way of combining CNN-based with model-based face
reconstruction, the CNN-based encoder learns to extract semantically meaningful
parameters from a single monocular input image. For the first time, a CNN
encoder and an expert-designed generative model can be trained end-to-end in an
unsupervised manner, which renders training on very large (unlabeled) real
world data feasible. The obtained reconstructions compare favorably to current
state-of-the-art approaches in terms of quality and richness of representation.Comment: International Conference on Computer Vision (ICCV) 2017 (Oral), 13
page
- …