8,142 research outputs found
Object-Centric Learning with Slot Attention
Learning object-centric representations of complex scenes is a promising step
towards enabling efficient abstract reasoning from low-level perceptual
features. Yet, most deep learning approaches learn distributed representations
that do not capture the compositional properties of natural scenes. In this
paper, we present the Slot Attention module, an architectural component that
interfaces with perceptual representations such as the output of a
convolutional neural network and produces a set of task-dependent abstract
representations which we call slots. These slots are exchangeable and can bind
to any object in the input by specializing through a competitive procedure over
multiple rounds of attention. We empirically demonstrate that Slot Attention
can extract object-centric representations that enable generalization to unseen
compositions when trained on unsupervised object discovery and supervised
property prediction tasks
Object-centric Learning with Cyclic Walks between Parts and Whole
Learning object-centric representations from complex natural environments
enables both humans and machines with reasoning abilities from low-level
perceptual features. To capture compositional entities of the scene, we
proposed cyclic walks between perceptual features extracted from CNN or
transformers and object entities. First, a slot-attention module interfaces
with these perceptual features and produces a finite set of slot
representations. These slots can bind to any object entities in the scene via
inter-slot competitions for attention. Next, we establish entity-feature
correspondence with cyclic walks along high transition probability based on
pairwise similarity between perceptual features (aka "parts") and slot-binded
object representations (aka "whole"). The whole is greater than its parts and
the parts constitute the whole. The part-whole interactions form cycle
consistencies, as supervisory signals, to train the slot-attention module. We
empirically demonstrate that the networks trained with our cyclic walks can
extract object-centric representations on seven image datasets in three
unsupervised learning tasks. In contrast to object-centric models attached with
a decoder for image or feature reconstructions, our cyclic walks provide strong
supervision signals, avoiding computation overheads and enhancing memory
efficiency
Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior
The aim of object-centric vision is to construct an explicit representation
of the objects in a scene. This representation is obtained via a set of
interchangeable modules called \emph{slots} or \emph{object files} that compete
for local patches of an image. The competition has a weak inductive bias to
preserve spatial continuity; consequently, one slot may claim patches scattered
diffusely throughout the image. In contrast, the inductive bias of human vision
is strong, to the degree that attention has classically been described with a
spotlight metaphor. We incorporate a spatial-locality prior into
state-of-the-art object-centric vision models and obtain significant
improvements in segmenting objects in both synthetic and real-world datasets.
Similar to human visual attention, the combination of image content and spatial
constraints yield robust unsupervised object-centric learning, including less
sensitivity to model hyperparameters.Comment: 16 pages, 3 figures, under review at NeurIPS 202
Object-centric architectures enable efficient causal representation learning
Causal representation learning has showed a variety of settings in which we
can disentangle latent variables with identifiability guarantees (up to some
reasonable equivalence class). Common to all of these approaches is the
assumption that (1) the latent variables are represented as -dimensional
vectors, and (2) that the observations are the output of some injective
generative function of these latent variables. While these assumptions appear
benign, we show that when the observations are of multiple objects, the
generative function is no longer injective and disentanglement fails in
practice. We can address this failure by combining recent developments in
object-centric learning and causal representation learning. By modifying the
Slot Attention architecture arXiv:2006.15055, we develop an object-centric
architecture that leverages weak supervision from sparse perturbations to
disentangle each object's properties. This approach is more data-efficient in
the sense that it requires significantly fewer perturbations than a comparable
approach that encodes to a Euclidean space and we show that this approach
successfully disentangles the properties of a set of objects in a series of
simple image-based disentanglement experiments
Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction
When perceiving the world from multiple viewpoints, humans have the ability
to reason about the complete objects in a compositional manner even when an
object is completely occluded from certain viewpoints. Meanwhile, humans are
able to imagine novel views after observing multiple viewpoints. Recent
remarkable advances in multi-view object-centric learning still leaves some
unresolved problems: 1) The shapes of partially or completely occluded objects
can not be well reconstructed. 2) The novel viewpoint prediction depends on
expensive viewpoint annotations rather than implicit rules in view
representations. In this paper, we introduce a time-conditioned generative
model for videos. To reconstruct the complete shape of an object accurately, we
enhance the disentanglement between the latent representations of objects and
views, where the latent representations of time-conditioned views are jointly
inferred with a Transformer and then are input to a sequential extension of
Slot Attention to learn object-centric representations. In addition, Gaussian
processes are employed as priors of view latent variables for video generation
and novel-view prediction without viewpoint annotations. Experiments on
multiple datasets demonstrate that the proposed model can make object-centric
video decomposition, reconstruct the complete shapes of occluded objects, and
make novel-view predictions
Generalization and Robustness Implications in Object-Centric Learning
The idea behind object-centric representation learning is that natural scenes
can better be modeled as compositions of objects and their relations as opposed
to distributed representations. This inductive bias can be injected into neural
networks to potentially improve systematic generalization and learning
efficiency of downstream tasks in scenes with multiple objects. In this paper,
we train state-of-the-art unsupervised models on five common multi-object
datasets and evaluate segmentation accuracy and downstream object property
prediction. In addition, we study systematic generalization and robustness by
investigating the settings where either single objects are out-of-distribution
-- e.g., having unseen colors, textures, and shapes -- or global properties of
the scene are altered -- e.g., by occlusions, cropping, or increasing the
number of objects. From our experimental study, we find object-centric
representations to be generally useful for downstream tasks and robust to
shifts in the data distribution, especially if shifts affect single objects
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Object-centric learning aims to represent visual data with a set of object
entities (a.k.a. slots), providing structured representations that enable
systematic generalization. Leveraging advanced architectures like Transformers,
recent approaches have made significant progress in unsupervised object
discovery. In addition, slot-based representations hold great potential for
generative modeling, such as controllable image generation and object
manipulation in image editing. However, current slot-based methods often
produce blurry images and distorted objects, exhibiting poor generative
modeling capabilities. In this paper, we focus on improving slot-to-image
decoding, a crucial aspect for high-quality visual generation. We introduce
SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for
both image and video data. Thanks to the powerful modeling capacity of LDMs,
SlotDiffusion surpasses previous slot models in unsupervised object
segmentation and visual generation across six datasets. Furthermore, our
learned object features can be utilized by existing object-centric dynamics
models, improving video prediction quality and downstream temporal reasoning
tasks. Finally, we demonstrate the scalability of SlotDiffusion to
unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated
with self-supervised pre-trained image encoders.Comment: Project page: https://slotdiffusion.github.io/ . An earlier version
of this work appeared at the ICLR 2023 Workshop on Neurosymbolic Generative
Models: https://nesygems.github.io/assets/pdf/papers/SlotDiffusion.pd
Object-Centric Slot Diffusion
The recent success of transformer-based image generative models in
object-centric learning highlights the importance of powerful image generators
for handling complex scenes. However, despite the high expressiveness of
diffusion models in image generation, their integration into object-centric
learning remains largely unexplored in this domain. In this paper, we explore
the feasibility and potential of integrating diffusion models into
object-centric learning and investigate the pros and cons of this approach. We
introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes:
it is the first object-centric learning model to replace conventional slot
decoders with a latent diffusion model conditioned on object slots, and it is
also the first unsupervised compositional conditional diffusion model that
operates without the need for supervised annotations like text. Through
experiments on various object-centric tasks, including the first application of
the FFHQ dataset in this field, we demonstrate that LSD significantly
outperforms state-of-the-art transformer-based decoders, particularly in more
complex scenes, and exhibits superior unsupervised compositional generation
quality. Project page is available at
$\href{https://latentslotdiffusion.github.io}{here}
Sensitivity of Slot-Based Object-Centric Models to their Number of Slots
Self-supervised methods for learning object-centric representations have
recently been applied successfully to various datasets. This progress is
largely fueled by slot-based methods, whose ability to cluster visual scenes
into meaningful objects holds great promise for compositional generalization
and downstream learning. In these methods, the number of slots (clusters)
is typically chosen to match the number of ground-truth objects in the data,
even though this quantity is unknown in real-world settings. Indeed, the
sensitivity of slot-based methods to , and how this affects their learned
correspondence to objects in the data has largely been ignored in the
literature. In this work, we address this issue through a systematic study of
slot-based methods. We propose using analogs to precision and recall based on
the Adjusted Rand Index to accurately quantify model behavior over a large
range of . We find that, especially during training, incorrect choices of
do not yield the desired object decomposition and, in fact, cause
substantial oversegmentation or merging of separate objects
(undersegmentation). We demonstrate that the choice of the objective function
and incorporating instance-level annotations can moderately mitigate this
behavior while still falling short of fully resolving this issue. Indeed, we
show how this issue persists across multiple methods and datasets and stress
its importance for future slot-based models
- …