36,999 research outputs found
Unsupervised Learning of 3D Structure from Images
A key goal of computer vision is to recover the underlying 3D structure from
2D observations of the world. In this paper we learn strong deep generative
models of 3D structures, and recover these structures from 3D and 2D images via
probabilistic inference. We demonstrate high-quality samples and report
log-likelihoods on several datasets, including ShapeNet [2], and establish the
first benchmarks in the literature. We also show how these models and their
inference networks can be trained end-to-end from 2D images. This demonstrates
for the first time the feasibility of learning to infer 3D representations of
the world in a purely unsupervised manner.Comment: Appears in Advances in Neural Information Processing Systems 29 (NIPS
2016
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations
Unsupervised learning with generative models has the potential of discovering
rich representations of 3D scenes. While geometric deep learning has explored
3D-structure-aware representations of scene geometry, these models typically
require explicit 3D supervision. Emerging neural scene representations can be
trained only with posed 2D images, but existing methods ignore the
three-dimensional structure of scenes. We propose Scene Representation Networks
(SRNs), a continuous, 3D-structure-aware scene representation that encodes both
geometry and appearance. SRNs represent scenes as continuous functions that map
world coordinates to a feature representation of local scene properties. By
formulating the image formation as a differentiable ray-marching algorithm,
SRNs can be trained end-to-end from only 2D images and their camera poses,
without access to depth or shape. This formulation naturally generalizes across
scenes, learning powerful geometry and appearance priors in the process. We
demonstrate the potential of SRNs by evaluating them for novel view synthesis,
few-shot reconstruction, joint shape and appearance interpolation, and
unsupervised discovery of a non-rigid face model.Comment: Video: https://youtu.be/6vMEBWD8O20 Project page:
https://vsitzmann.github.io/srns
Deep NRSfM++: Towards Unsupervised 2D-3D Lifting in the Wild
The recovery of 3D shape and pose from 2D landmarks stemming from a large
ensemble of images can be viewed as a non-rigid structure from motion (NRSfM)
problem. Classical NRSfM approaches, however, are problematic as they rely on
heuristic priors on the 3D structure (e.g. low rank) that do not scale well to
large datasets. Learning-based methods are showing the potential to reconstruct
a much broader set of 3D structures than classical methods -- dramatically
expanding the importance of NRSfM to atemporal unsupervised 2D to 3D lifting.
Hitherto, these learning approaches have not been able to effectively model
perspective cameras or handle missing/occluded points -- limiting their
applicability to in-the-wild datasets. In this paper, we present a generalized
strategy for improving learning-based NRSfM methods to tackle the above issues.
Our approach, Deep NRSfM++, achieves state-of-the-art performance across
numerous large-scale benchmarks, outperforming both classical and
learning-based 2D-3D lifting methods
Unsupervised Learning of Visual 3D Keypoints for Control
Learning sensorimotor control policies from high-dimensional images crucially
relies on the quality of the underlying visual representations. Prior works
show that structured latent space such as visual keypoints often outperforms
unstructured representations for robotic control. However, most of these
representations, whether structured or unstructured are learned in a 2D space
even though the control tasks are usually performed in a 3D environment. In
this work, we propose a framework to learn such a 3D geometric structure
directly from images in an end-to-end unsupervised manner. The input images are
embedded into latent 3D keypoints via a differentiable encoder which is trained
to optimize both a multi-view consistency loss and downstream task objective.
These discovered 3D keypoints tend to meaningfully capture robot joints as well
as object movements in a consistent manner across both time and 3D space. The
proposed approach outperforms prior state-of-art methods across a variety of
reinforcement learning benchmarks. Code and videos at
https://buoyancy99.github.io/unsup-3d-keypoints/Comment: Accepted at ICML 2021. Videos and code at
https://buoyancy99.github.io/unsup-3d-keypoints
DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images
Differentiable renderers have been used successfully for unsupervised 3D
structure learning from 2D images because they can bridge the gap between 3D
and 2D. To optimize 3D shape parameters, current renderers rely on pixel-wise
losses between rendered images of 3D reconstructions and ground truth images
from corresponding viewpoints. Hence they require interpolation of the
recovered 3D structure at each pixel, visibility handling, and optionally
evaluating a shading model. In contrast, here we propose a Differentiable
Renderer Without Rendering (DRWR) that omits these steps. DRWR only relies on a
simple but effective loss that evaluates how well the projections of
reconstructed 3D point clouds cover the ground truth object silhouette.
Specifically, DRWR employs a smooth silhouette loss to pull the projection of
each individual 3D point inside the object silhouette, and a structure-aware
repulsion loss to push each pair of projections that fall inside the silhouette
far away from each other. Although we omit surface interpolation, visibility
handling, and shading, our results demonstrate that DRWR achieves
state-of-the-art accuracies under widely used benchmarks, outperforming
previous methods both qualitatively and quantitatively. In addition, our
training times are significantly lower due to the simplicity of DRWR.Comment: Accepted at ICML202
Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection
Non-parametric face modeling aims to reconstruct 3D face only from images
without shape assumptions. While plausible facial details are predicted, the
models tend to over-depend on local color appearance and suffer from ambiguous
noise. To address such problem, this paper presents a novel Learning to
Aggregate and Personalize (LAP) framework for unsupervised robust 3D face
modeling. Instead of using controlled environment, the proposed method
implicitly disentangles ID-consistent and scene-specific face from
unconstrained photo set. Specifically, to learn ID-consistent face, LAP
adaptively aggregates intrinsic face factors of an identity based on a novel
curriculum learning approach with relaxed consistency loss. To adapt the face
for a personalized scene, we propose a novel attribute-refining network to
modify ID-consistent face with target attribute and details. Based on the
proposed method, we make unsupervised 3D face modeling benefit from meaningful
image facial structure and possibly higher resolutions. Extensive experiments
on benchmarks show LAP recovers superior or competitive face shape and texture,
compared with state-of-the-art (SOTA) methods with or without prior and
supervision.Comment: CVPR 2021 Oral, 11 pages, 9 figure
Joint Representation of Multiple Geometric Priors via a Shape Decomposition Model for Single Monocular 3D Pose Estimation
In this paper, we aim to recover the 3D human pose from 2D body joints of a
single image. The major challenge in this task is the depth ambiguity since
different 3D poses may produce similar 2D poses. Although many recent advances
in this problem are found in both unsupervised and supervised learning
approaches, the performances of most of these approaches are greatly affected
by insufficient diversities and richness of training data. To alleviate this
issue, we propose an unsupervised learning approach, which is capable of
estimating various complex poses well under limited available training data.
Specifically, we propose a Shape Decomposition Model (SDM) in which a 3D pose
is considered as the superposition of two parts which are global structure
together with some deformations. Based on SDM, we estimate these two parts
explicitly by solving two sets of different distributed combination
coefficients of geometric priors. In addition, to obtain geometric priors, a
joint dictionary learning algorithm is proposed to extract both coarse and fine
pose clues simultaneously from limited training data. Quantitative evaluations
on several widely used datasets demonstrate that our approach yields better
performances over other competitive approaches. Especially, on some categories
with more complex deformations, significant improvements are achieved by our
approach. Furthermore, qualitative experiments conducted on in-the-wild images
also show the effectiveness of the proposed approach
End-to-end 3D shape inverse rendering of different classes of objects from a single input image
In this paper a semi-supervised deep framework is proposed for the problem of
3D shape inverse rendering from a single 2D input image. The main structure of
proposed framework consists of unsupervised pre-trained components which
significantly reduce the need to labeled data for training the whole framework.
using labeled data has the advantage of achieving to accurate results without
the need to predefined assumptions about image formation process. Three main
components are used in the proposed network: an encoder which maps 2D input
image to a representation space, a 3D decoder which decodes a representation to
a 3D structure and a mapping component in order to map 2D to 3D representation.
The only part that needs label for training is the mapping part with not too
many parameters. The other components in the network can be pre-trained
unsupervised using only 2D images or 3D data in each case. The way of
reconstructing 3D shapes in the decoder component, inspired by the model based
methods for 3D reconstruction, maps a low dimensional representation to 3D
shape space with the advantage of extracting the basis vectors of shape space
from training data itself and is not restricted to a small set of examples as
used in predefined models. Therefore, the proposed framework deals directly
with coordinate values of the point cloud representation which leads to achieve
dense 3D shapes in the output. The experimental results on several benchmark
datasets of objects and human faces and comparing with recent similar methods
shows the power of proposed network in recovering more details from single 2D
images.Comment: 16 pages, 12 figures, 2 table
DispSegNet: Leveraging Semantics for End-to-End Learning of Disparity Estimation from Stereo Imagery
Recent work has shown that convolutional neural networks (CNNs) can be
applied successfully in disparity estimation, but these methods still suffer
from errors in regions of low-texture, occlusions and reflections.
Concurrently, deep learning for semantic segmentation has shown great progress
in recent years. In this paper, we design a CNN architecture that combines
these two tasks to improve the quality and accuracy of disparity estimation
with the help of semantic segmentation. Specifically, we propose a network
structure in which these two tasks are highly coupled. One key novelty of this
approach is the two-stage refinement process. Initial disparity estimates are
refined with an embedding learned from the semantic segmentation branch of the
network. The proposed model is trained using an unsupervised approach, in which
images from one half of the stereo pair are warped and compared against images
from the other camera. Another key advantage of the proposed approach is that a
single network is capable of outputting disparity estimates and semantic
labels. These outputs are of great use in autonomous vehicle operation; with
real-time constraints being key, such performance improvements increase the
viability of driving applications. Experiments on KITTI and Cityscapes datasets
show that our model can achieve state-of-the-art results and that leveraging
embedding learned from semantic segmentation improves the performance of
disparity estimation.Comment: Add more description on the architecture of the model. Add more
discussion on section IV-C. Fix typo in formula
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Learning to estimate 3D geometry in a single frame and optical flow from
consecutive frames by watching unlabeled videos via deep convolutional network
has made significant progress recently. Current state-of-the-art (SoTA) methods
treat the two tasks independently. One typical assumption of the existing depth
estimation methods is that the scenes contain no independent moving objects.
while object moving could be easily modeled using optical flow. In this paper,
we propose to address the two tasks as a whole, i.e. to jointly understand
per-pixel 3D geometry and motion. This eliminates the need of static scene
assumption and enforces the inherent geometrical consistency during the
learning process, yielding significantly improved results for both tasks. We
call our method as "Every Pixel Counts++" or "EPC++". Specifically, during
training, given two consecutive frames from a video, we adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet),
and per-pixel optical flow between two frames (OptFlowNet) respectively. The
three types of information are fed into a holistic 3D motion parser (HMP), and
per-pixel 3D motion of both rigid background and moving objects are
disentangled and recovered. Comprehensive experiments were conducted on
datasets with different scenes, including driving scenario (KITTI 2012 and
KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic
animation (MPI Sintel dataset). Performance on the five tasks of depth
estimation, optical flow estimation, odometry, moving object segmentation and
scene flow estimation shows that our approach outperforms other SoTA methods.
Code will be available at: https://github.com/chenxuluo/EPC.Comment: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI
submissio
- …