An image of a scene can be described by the shape, pose and appearance of the objects
within it, as well as the illumination, and the camera that captured it. A fundamental
goal in computer vision is to recover such descriptions from an image. Such representations
can be useful for tasks such as autonomous robotic interaction with an
environment, but obtaining them can be very challenging due the large variability of
objects present in natural scenes.
A long-standing approach in computer vision is to use generative models of images in
order to infer the descriptions that generated the image. These methods are referred to
as “vision as inverse graphics” or “inverse graphics”. We propose using this approach
to scene understanding by making use of a generative model (GM) in the form of
a graphics renderer. Since searching over scene factors to obtain the best match for
an image is very inefficient, we make use of convolutional neural networks, which
we refer to as the recognition models (RM), trained on synthetic data to initialize the
search.
First we address the effect that occlusions on objects have on the performance of predictive
models of images. We propose an inverse graphics approach to predicting the
shape, pose, appearance and illumination with a GM which includes an outlier model
to account for occlusions. We study how the inferences are affected by the degree
of occlusion of the foreground object, and show that a robust GM which includes
an outlier model to account for occlusions works significantly better than a non-robust
model. We then characterize the performance of the RM and the gains that can be made
by refining the search using the robust GM, using a new synthetic dataset that includes
background clutter and occlusions. We find that pose and shape are predicted very well
by the RM, but appearance and especially illumination less so. However, accuracy on
these latter two factors can be clearly improved with the generative model.
Next we apply our inverse graphics approach to scenes with multiple objects. We
propose using a method to efficiently and differentiably model self shadowing which
improves the realism of the GM renders. We also propose a way to render object occlusion
boundaries which results in more accurate gradients of the rendering function.
We evaluate these improvements using a dataset with multiple objects and show that
the refinement step of the GM clearly improves on the predictions of the RM for the
latent variables of shape, pose, appearance and illumination.
Finally we tackle the task of learning generative models of 3D objects from a collection
of meshes. We present a latent variable architecture that learns to separately capture
the underlying factors of shape and appearance from the meshes. To do so we first
transform the meshes of a given class to a data representation that sidesteps the need
for landmark correspondences across meshes when learning the GM. The ability and
usefulness of learning a disentangled latent representation of objects is demonstrated
via an experiment where the appearance of one object is transferred onto the shape of
another