12 research outputs found
Reconstruction Bottlenecks in Object-Centric Generative Models
A range of methods with suitable inductive biases exist to learn
interpretable object-centric representations of images without supervision.
However, these are largely restricted to visually simple images; robust object
discovery in real-world sensory datasets remains elusive. To increase the
understanding of such inductive biases, we empirically investigate the role of
"reconstruction bottlenecks" for scene decomposition in GENESIS, a recent
VAE-based model. We show such bottlenecks determine reconstruction and
segmentation quality and critically influence model behaviour.Comment: 10 pages, 7 Figures, Workshop on Object-Oriented Learning at ICML
202
Unsupervised Learning of Lagrangian Dynamics from Images for Prediction and Control
Recent approaches for modelling dynamics of physical systems with neural
networks enforce Lagrangian or Hamiltonian structure to improve prediction and
generalization. However, these approaches fail to handle the case when
coordinates are embedded in high-dimensional data such as images. We introduce
a new unsupervised neural network model that learns Lagrangian dynamics from
images, with interpretability that benefits prediction and control. The model
infers Lagrangian dynamics on generalized coordinates that are simultaneously
learned with a coordinate-aware variational autoencoder (VAE). The VAE is
designed to account for the geometry of physical systems composed of multiple
rigid bodies in the plane. By inferring interpretable Lagrangian dynamics, the
model learns physical system properties, such as kinetic and potential energy,
which enables long-term prediction of dynamics in the image space and synthesis
of energy-based controllers
Unsupervised object-centric video generation and decomposition in 3D
A natural approach to generative modeling of videos is to represent them as a
composition of moving objects. Recent works model a set of 2D sprites over a
slowly-varying background, but without considering the underlying 3D scene that
gives rise to them. We instead propose to model a video as the view seen while
moving through a scene with multiple 3D objects and a 3D background. Our model
is trained from monocular videos without any supervision, yet learns to
generate coherent 3D scenes containing several moving objects. We conduct
detailed experiments on two datasets, going beyond the visual complexity
supported by state-of-the-art generative approaches. We evaluate our method on
depth-prediction and 3D object detection -- tasks which cannot be addressed by
those earlier works -- and show it out-performs them even on 2D instance
segmentation and tracking.Comment: Appeared at NeurIPS 2020. Project page: http://pmh47.net/o3v
RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces
We present RELATE, a model that learns to generate physically plausible
scenes and videos of multiple interacting objects. Similar to other generative
approaches, RELATE is trained end-to-end on raw, unlabeled data. RELATE
combines an object-centric GAN formulation with a model that explicitly
accounts for correlations between individual objects. This allows the model to
generate realistic scenes and videos from a physically-interpretable
parameterization. Furthermore, we show that modeling the object correlation is
necessary to learn to disentangle object positions and identity. We find that
RELATE is also amenable to physically realistic scene editing and that it
significantly outperforms prior art in object-centric scene generation in both
synthetic (CLEVR, ShapeStacks) and real-world data (cars). In addition, in
contrast to state-of-the-art methods in object-centric generative modeling,
RELATE also extends naturally to dynamic scenes and generates videos of high
visual fidelity. Source code, datasets and more results are available at
http://geometry.cs.ucl.ac.uk/projects/2020/relate/
Unsupervised multi-object segmentation by predicting probable motion patterns
We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence of objects. Hence, our model learns to predict image regions that are likely to contain motion patterns characteristic of objects moving rigidly. It does not predict specific motion, which cannot be done unambiguously from a still image, but a distribution of possible motions, which includes the possibility that an object does not move at all. We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement. Project page and code: https://www.robots.ox.ac.uk/~vgg/research/ppmp
Active vision for robot manipulators using the free energy principle
Occlusions, restricted field of view and limited resolution all constrain a robot's ability to sense its environment from a single observation. In these cases, the robot first needs to actively query multiple observations and accumulate information before it can complete a task. In this paper, we cast this problem of active vision as active inference, which states that an intelligent agent maintains a generative model of its environment and acts in order to minimize its surprise, or expected free energy according to this model. We apply this to an object-reaching task for a 7-DOF robotic manipulator with an in-hand camera to scan the workspace. A novel generative model using deep neural networks is proposed that is able to fuse multiple views into an abstract representation and is trained from data by minimizing variational free energy. We validate our approach experimentally for a reaching task in simulation in which a robotic agent starts without any knowledge about its workspace. Each step, the next view pose is chosen by evaluating the expected free energy. We find that by minimizing the expected free energy, exploratory behavior emerges when the target object to reach is not in view, and the end effector is moved to the correct reach position once the target is located. Similar to an owl scavenging for prey, the robot naturally prefers higher ground for exploring, approaching its target once located
Generalization and Robustness Implications in Object-Centric Learning
The idea behind object-centric representation learning is that natural scenes
can better be modeled as compositions of objects and their relations as opposed
to distributed representations. This inductive bias can be injected into neural
networks to potentially improve systematic generalization and learning
efficiency of downstream tasks in scenes with multiple objects. In this paper,
we train state-of-the-art unsupervised models on five common multi-object
datasets and evaluate segmentation accuracy and downstream object property
prediction. In addition, we study systematic generalization and robustness by
investigating the settings where either single objects are out-of-distribution
-- e.g., having unseen colors, textures, and shapes -- or global properties of
the scene are altered -- e.g., by occlusions, cropping, or increasing the
number of objects. From our experimental study, we find object-centric
representations to be generally useful for downstream tasks and robust to
shifts in the data distribution, especially if shifts affect single objects