30,246 research outputs found
CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
We present a method for teaching machines to understand and model the
underlying spatial common sense of diverse human-object interactions in 3D in a
self-supervised way. This is a challenging task, as there exist specific
manifolds of the interactions that can be considered human-like and natural,
but the human pose and the geometry of objects can vary even for similar
interactions. Such diversity makes the annotating task of 3D interactions
difficult and hard to scale, which limits the potential to reason about that in
a supervised way. One way of learning the 3D spatial relationship between
humans and objects during interaction is by showing multiple 2D images captured
from different viewpoints when humans interact with the same type of objects.
The core idea of our method is to leverage a generative model that produces
high-quality 2D images from an arbitrary text prompt input as an "unbounded"
data generator with effective controllability and view diversity. Despite its
imperfection of the image quality over real images, we demonstrate that the
synthesized images are sufficient to learn the 3D human-object spatial
relations. We present multiple strategies to leverage the synthesized images,
including (1) the first method to leverage a generative image model for 3D
human-object spatial relation learning; (2) a framework to reason about the 3D
spatial relations from inconsistent 2D cues in a self-supervised manner via 3D
occupancy reasoning with pose canonicalization; (3) semantic clustering to
disambiguate different types of interactions with the same object types; and
(4) a novel metric to assess the quality of 3D spatial learning of interaction.Comment: Accepted to ICCV 2023 (Oral Presentation). Project Page:
https://jellyheadandrew.github.io/projects/choru
Self-Supervised Intrinsic Image Decomposition
Intrinsic decomposition from a single image is a highly challenging task, due
to its inherent ambiguity and the scarcity of training data. In contrast to
traditional fully supervised learning approaches, in this paper we propose
learning intrinsic image decomposition by explaining the input image. Our
model, the Rendered Intrinsics Network (RIN), joins together an image
decomposition pipeline, which predicts reflectance, shape, and lighting
conditions given a single image, with a recombination function, a learned
shading model used to recompose the original input based off of intrinsic image
predictions. Our network can then use unsupervised reconstruction error as an
additional signal to improve its intermediate representations. This allows
large-scale unlabeled data to be useful during training, and also enables
transferring learned knowledge to images of unseen object categories, lighting
conditions, and shapes. Extensive experiments demonstrate that our method
performs well on both intrinsic image decomposition and knowledge transfer.Comment: NIPS 2017 camera-ready version, project page:
http://rin.csail.mit.edu
Learning Shape Priors for Single-View 3D Completion and Reconstruction
The problem of single-view 3D shape completion or reconstruction is
challenging, because among the many possible shapes that explain an
observation, most are implausible and do not correspond to natural objects.
Recent research in the field has tackled this problem by exploiting the
expressiveness of deep convolutional networks. In fact, there is another level
of ambiguity that is often overlooked: among plausible shapes, there are still
multiple shapes that fit the 2D image equally well; i.e., the ground truth
shape is non-deterministic given a single-view input. Existing fully supervised
approaches fail to address this issue, and often produce blurry mean shapes
with smooth surfaces but no fine details.
In this paper, we propose ShapeHD, pushing the limit of single-view shape
completion and reconstruction by integrating deep generative models with
adversarially learned shape priors. The learned priors serve as a regularizer,
penalizing the model only if its output is unrealistic, not if it deviates from
the ground truth. Our design thus overcomes both levels of ambiguity
aforementioned. Experiments demonstrate that ShapeHD outperforms state of the
art by a large margin in both shape completion and shape reconstruction on
multiple real datasets.Comment: ECCV 2018. The first two authors contributed equally to this work.
Project page: http://shapehd.csail.mit.edu
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids
We introduce an unsupervised feature learning approach that embeds 3D shape
information into a single-view image representation. The main idea is a
self-supervised training objective that, given only a single 2D image, requires
all unseen views of the object to be predictable from learned features. We
implement this idea as an encoder-decoder convolutional neural network. The
network maps an input image of an unknown category and unknown viewpoint to a
latent space, from which a deconvolutional decoder can best "lift" the image to
its complete viewgrid showing the object from all viewing angles. Our
class-agnostic training procedure encourages the representation to capture
fundamental shape primitives and semantic regularities in a data-driven
manner---without manual semantic labels. Our results on two widely-used shape
datasets show 1) our approach successfully learns to perform "mental rotation"
even for objects unseen during training, and 2) the learned latent space is a
powerful representation for object recognition, outperforming several existing
unsupervised feature learning methods.Comment: To appear at ECCV 201
- …