164 research outputs found
PersonNeRF: Personalized Reconstruction from Photo Collections
We present PersonNeRF, a method that takes a collection of photos of a
subject (e.g. Roger Federer) captured across multiple years with arbitrary body
poses and appearances, and enables rendering the subject with arbitrary novel
combinations of viewpoint, body pose, and appearance. PersonNeRF builds a
customized neural volumetric 3D model of the subject that is able to render an
entire space spanned by camera viewpoint, body pose, and appearance. A central
challenge in this task is dealing with sparse observations; a given body pose
is likely only observed by a single viewpoint with a single appearance, and a
given appearance is only observed under a handful of different body poses. We
address this issue by recovering a canonical T-pose neural volumetric
representation of the subject that allows for changing appearance across
different observations, but uses a shared pose-dependent motion field across
all observations. We demonstrate that this approach, along with regularization
of the recovered volumetric geometry to encourage smoothness, is able to
recover a model that renders compelling images from novel combinations of
viewpoint, pose, and appearance from these challenging unstructured photo
collections, outperforming prior work for free-viewpoint human rendering.Comment: Project Page: https://grail.cs.washington.edu/projects/personnerf
Instant Multi-View Head Capture through Learnable Registration
Existing methods for capturing datasets of 3D heads in dense semantic
correspondence are slow, and commonly address the problem in two separate
steps; multi-view stereo (MVS) reconstruction followed by non-rigid
registration. To simplify this process, we introduce TEMPEH (Towards Estimation
of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads
in dense correspondence from calibrated multi-view images. Registering datasets
of 3D scans typically requires manual parameter tuning to find the right
balance between accurately fitting the scans surfaces and being robust to
scanning noise and outliers. Instead, we propose to jointly register a 3D head
dataset while training TEMPEH. Specifically, during training we minimize a
geometric loss commonly used for surface registration, effectively leveraging
TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric
feature representation that samples and fuses features from each view using
camera calibration information. To account for partial occlusions and a large
capture volume that enables head movements, we use view- and surface-aware
feature fusion, and a spatial transformer-based head localization module,
respectively. We use raw MVS scans as supervision during training, but, once
trained, TEMPEH directly predicts 3D heads in dense correspondence without
requiring scans. Predicting one head takes about 0.3 seconds with a median
reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art.
This enables the efficient capture of large datasets containing multiple people
and diverse facial motions. Code, model, and data are publicly available at
https://tempeh.is.tue.mpg.de.Comment: Conference on Computer Vision and Pattern Recognition (CVPR) 202
Virtual Occlusions Through Implicit Depth
For augmented reality (AR), it is important that virtual assets appear to 'sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes
The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth
Self-supervised monocular depth estimation networks are trained to predict
scene depth using nearby frames as a supervision signal during training.
However, for many applications, sequence information in the form of video
frames is also available at test time. The vast majority of monocular networks
do not make use of this extra signal, thus ignoring valuable information that
could be used to improve the predicted depth. Those that do, either use
computationally expensive test-time refinement techniques or off-the-shelf
recurrent networks, which only indirectly make use of the geometric information
that is inherently available.
We propose ManyDepth, an adaptive approach to dense depth estimation that can
make use of sequence information at test time, when it is available. Taking
inspiration from multi-view stereo, we propose a deep end-to-end cost volume
based approach that is trained using self-supervision only. We present a novel
consistency loss that encourages the network to ignore the cost volume when it
is deemed unreliable, e.g. in the case of moving objects, and an augmentation
scheme to cope with static cameras. Our detailed experiments on both KITTI and
Cityscapes show that we outperform all published self-supervised baselines,
including those that use single or multiple frames at test time.Comment: CVPR 202
- …