4 research outputs found
Learning Direct Optimization for scene understanding
We develop a Learning Direct Optimization (LiDO) method for the refinement of
a latent variable model that describes input image x. Our goal is to explain a
single image x with an interpretable 3D computer graphics model having scene
graph latent variables z (such as object appearance, camera position). Given a
current estimate of z we can render a prediction of the image g(z), which can
be compared to the image x. The standard way to proceed is then to measure the
error E(x, g(z)) between the two, and use an optimizer to minimize the error.
However, it is unknown which error measure E would be most effective for
simultaneously addressing issues such as misaligned objects, occlusions,
textures, etc. In contrast, the LiDO approach trains a Prediction Network to
predict an update directly to correct z, rather than minimizing the error with
respect to z. Experiments show that our LiDO method converges rapidly as it
does not need to perform a search on the error landscape, produces better
solutions than error-based competitors, and is able to handle the mismatch
between the data and the fitted scene model. We apply LiDO to a realistic
synthetic dataset, and show that the method also transfers to work well with
real images
3D scene graph inference and refinement for vision-as-inverse-graphics
The goal of scene understanding is to interpret images,
so as to infer the objects present in a scene, their poses
and fine-grained details. This thesis focuses on methods that
can provide a much more detailed explanation of the scene than
standard bounding-boxes or pixel-level segmentation - we infer
the underlying 3D scene given only its
projection in the form of a single image.
We employ the Vision-as-Inverse-Graphics (VIG) paradigm,
which (a) infers the latent variables of a scene such
as the objects present and their properties as well as the lighting
and the camera, and (b) renders these
latent variables to reconstruct the input image.
One highly attractive aspect of the VIG approach is that it produces
a compact and interpretable representation of the 3D scene in
terms of an arbitrary number of objects, called a 'scene graph'.
This representation is of a key importance, as it
can be useful e.g. if we wish to edit, refine,
interpret the scene or interact with it.
First, we investigate how the recognition models can be used to infer
the scene graph given only a single RGB image. These models are
trained using realistic synthetic images and corresponding ground
truth scene graphs, obtained from a rich stochastic scene
generator. Once the objects have been detected, each object detection
is further processed using neural networks to predict
the object and global latent variables.
This allows computing of object poses
and sizes in 3D scene coordinates, given the camera parameters. This
inference of the latent variables in the form of a 3D scene graph acts
like the encoder of an autoencoder, with graphics
rendering as the decoder.
One of the major challenges is the problem of placing the
detected objects in 3D at a reasonable size and distance with
respect to the single camera, the parameters of
which are unknown. Previous VIG approaches for
multiple objects usually only considered a fixed camera,
while we allow for variable camera pose. To infer the camera
parameters given the votes cast by the detected objects,
we introduce a Probabilistic HoughNets framework for combining
probabilistic votes, robustified with an outlier model.
Each detection provides one noisy low-dimensional manifold
in the Hough space, and by intersecting them
probabilistically we reduce the uncertainty on the camera parameters.
Given an initialization of a scene graph, its refinement typically
involves computationally expensive and inefficient
search through the latent space. Since optimization of the 3D scene
corresponding to an image is a challenging task even for a few LVs,
previous work for multi-object scenes considered only refinement of
the geometry, but not the appearance or illumination. To overcome this
issue, we develop a framework called 'Learning Direct Optimization'
(LiDO) for optimization of the latent variables of a multi-object
scene. Instead of minimizing an error metric that compares observed
image and the render, this optimization is driven by neural networks
that make use of the auto-context in the form of a current scene graph
and its render to predict the LV update.
Our experiments show that the LiDO method converges rapidly
as it does not need to perform a search on the error landscape,
produces better solutions than error-based competitors, and is able
to handle the mismatch between the data and the fitted scene model.
We apply LiDO to a realistic synthetic dataset, and show
that the method transfers to work well with real images.
The advantages of LiDO mean that it could be a critical component
in the development of future vision-as-inverse-graphics systems
Dataset for: Learning Direct Optimization for Scene Understanding
Description:The dataset consists of of a large number of realistic synthetic images that feature a number of objects on a table-top, of three classes: staplers, mugs and bananas. These are taken at a variety of lighting, viewpoint and object configuration conditions.In addition, the dataset includes a set of annotated real images that were manually taken to feature a number of objects of the considered classes. The dataset includes over 22000 realistic synthetic images that can be used for training and testing, and 135 annotated real images for testing.All datasets include object annotations and their masks.Image resolution is 256 x 256.Synthetic datasets include all the latent variables of the 3D scene (scene graph).The synthetic scenes were rendered using the Blender software: www.blender.org.For each object its associated latent variables are its position, scaling factor, azimuthal rotation, shape (1-of-K encoding) and colour (RGB). The ground plane has a random RGB colour. The camera is taken to be at a random height above the origin and to be looking down with a random angle of elevation. The illumination model is uniform lighting plus a directional source (specified by the strength, azimuth and elevation of the source).Real dataset: for each object we annotated its class, instance mask, and the contact point using the LabelMe software.THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOV
Dataset for: Learning Direct Optimization for Scene Understanding
Description:The dataset consists of of a large number of realistic synthetic images that feature a number of objects on a table-top, of three classes: staplers, mugs and bananas. These are taken at a variety of lighting, viewpoint and object configuration conditions.In addition, the dataset includes a set of annotated real images that were manually taken to feature a number of objects of the considered classes. The dataset includes over 22000 realistic synthetic images that can be used for training and testing, and 135 annotated real images for testing.All datasets include object annotations and their masks.Image resolution is 256 x 256.Synthetic datasets include all the latent variables of the 3D scene (scene graph).The synthetic scenes were rendered using the Blender software: www.blender.org.For each object its associated latent variables are its position, scaling factor, azimuthal rotation, shape (1-of-K encoding) and colour (RGB). The ground plane has a random RGB colour. The camera is taken to be at a random height above the origin and to be looking down with a random angle of elevation. The illumination model is uniform lighting plus a directional source (specified by the strength, azimuth and elevation of the source).Real dataset: for each object we annotated its class, instance mask, and the contact point using the LabelMe software