50 research outputs found
Recommended from our members
3D Shape Understanding and Generation
In recent years, Machine Learning techniques have revolutionized solutions to longstanding image-based problems, like image classification, generation, semantic segmentation, object detection and many others. However, if we want to be able to build agents that can successfully interact with the real world, those techniques need to be capable of reasoning about the world as it truly is: a tridimensional space. There are two main challenges while handling 3D information in machine learning models. First, it is not clear what is the best 3D representation. For images, convolutional neural networks (CNNs) operating on raster images yield the best results in virtually all image-based benchmarks. For 3D data, the best combination of model and representation is still an open question. Second, 3D data is not available on the same scale as images – taking pictures is a common procedure in our daily lives, whereas capturing 3D content is an activity usually restricted to specialized professionals. This thesis is focused on addressing both of these issues. Which model and representation should we use for generating and recognizing 3D data? What are efficient ways of learning 3D representations from a few examples? Is it possible to leverage image data to build models capable of reasoning about the world in 3D?
Our research findings show that it is possible to build models that efficiently generate 3D shapes as irregularly structured representations. Those models require significantly less memory while generating higher quality shapes than the ones based on voxels and multi-view representations. We start by developing techniques to generate shapes represented as point clouds. This class of models leads to high quality reconstructions and better unsupervised feature learning. However, since point clouds are not amenable to editing and human manipulation, we also present models capable of generating shapes as sets of shape handles -- simpler primitives that summarize complex 3D shapes and were specifically designed for high-level tasks and user interaction. Despite their effectiveness, those approaches require some form of 3D supervision, which is scarce. We present multiple alternatives to this problem. First, we investigate how approximate convex decomposition techniques can be used as self-supervision to improve recognition models when only a limited number of labels are available. Second, we study how neural network architectures induce shape priors that can be used in multiple reconstruction tasks -- using both volumetric and manifold representations. In this regime, reconstruction is performed from a single example -- either a sparse point cloud or multiple silhouettes. Finally, we demonstrate how to train generative models of 3D shapes without using any 3D supervision by combining differentiable rendering techniques and Generative Adversarial Networks
Shape Generation using Spatially Partitioned Point Clouds
We propose a method to generate 3D shapes using point clouds. Given a
point-cloud representation of a 3D shape, our method builds a kd-tree to
spatially partition the points. This orders them consistently across all
shapes, resulting in reasonably good correspondences across all shapes. We then
use PCA analysis to derive a linear shape basis across the spatially
partitioned points, and optimize the point ordering by iteratively minimizing
the PCA reconstruction error. Even with the spatial sorting, the point clouds
are inherently noisy and the resulting distribution over the shape coefficients
can be highly multi-modal. We propose to use the expressive power of neural
networks to learn a distribution over the shape coefficients in a
generative-adversarial framework. Compared to 3D shape generative models
trained on voxel-representations, our point-based method is considerably more
light-weight and scalable, with little loss of quality. It also outperforms
simpler linear factor models such as Probabilistic PCA, both qualitatively and
quantitatively, on a number of categories from the ShapeNet dataset.
Furthermore, our method can easily incorporate other point attributes such as
normal and color information, an additional advantage over voxel-based
representations.Comment: To appear at BMVC 201
3D Shape Reconstruction from Sketches via Multi-view Convolutional Networks
We propose a method for reconstructing 3D shapes from 2D sketches in the form
of line drawings. Our method takes as input a single sketch, or multiple
sketches, and outputs a dense point cloud representing a 3D reconstruction of
the input sketch(es). The point cloud is then converted into a polygon mesh. At
the heart of our method lies a deep, encoder-decoder network. The encoder
converts the sketch into a compact representation encoding shape information.
The decoder converts this representation into depth and normal maps capturing
the underlying surface from several output viewpoints. The multi-view maps are
then consolidated into a 3D point cloud by solving an optimization problem that
fuses depth and normals across all viewpoints. Based on our experiments,
compared to other methods, such as volumetric networks, our architecture offers
several advantages, including more faithful reconstruction, higher output
surface resolution, better preservation of topology and shape structure.Comment: 3DV 2017 (oral
Accidental Turntables: Learning 3D Pose by Watching Objects Turn
We propose a technique for learning single-view 3D object pose estimation
models by utilizing a new source of data -- in-the-wild videos where objects
turn. Such videos are prevalent in practice (e.g., cars in roundabouts,
airplanes near runways) and easy to collect. We show that classical
structure-from-motion algorithms, coupled with the recent advances in instance
detection and feature matching, provides surprisingly accurate relative 3D pose
estimation on such videos. We propose a multi-stage training scheme that first
learns a canonical pose across a collection of videos and then supervises a
model for single-view pose estimation. The proposed technique achieves
competitive performance with respect to existing state-of-the-art on standard
benchmarks for 3D pose estimation, without requiring any pose labels during
training. We also contribute an Accidental Turntables Dataset, containing a
challenging set of 41,212 images of cars in cluttered backgrounds, motion blur
and illumination changes that serves as a benchmark for 3D pose estimation.Comment: Project website: https://people.cs.umass.edu/~zezhoucheng/acci-turn
Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
Diffusion Handles is a novel approach to enabling 3D object edits on
diffusion images. We accomplish these edits using existing pre-trained
diffusion models, and 2D image depth estimation, without any fine-tuning or 3D
object retrieval. The edited results remain plausible, photo-real, and preserve
object identity. Diffusion Handles address a critically missing facet of
generative image based creative design, and significantly advance the
state-of-the-art in generative image editing. Our key insight is to lift
diffusion activations for an object to 3D using a proxy depth, 3D-transform the
depth and associated activations, and project them back to image space. The
diffusion process applied to the manipulated activations with identity control,
produces plausible edited images showing complex 3D occlusion and lighting
effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic
data benchmark; and qualitatively by a user study, showing our output to be
more plausible, and better than prior art at both, 3D editing and identity
control. Project Webpage: https://diffusionhandles.github.io/Comment: Project Webpage: https://diffusionhandles.github.io