149 research outputs found
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Object-centric learning aims to represent visual data with a set of object
entities (a.k.a. slots), providing structured representations that enable
systematic generalization. Leveraging advanced architectures like Transformers,
recent approaches have made significant progress in unsupervised object
discovery. In addition, slot-based representations hold great potential for
generative modeling, such as controllable image generation and object
manipulation in image editing. However, current slot-based methods often
produce blurry images and distorted objects, exhibiting poor generative
modeling capabilities. In this paper, we focus on improving slot-to-image
decoding, a crucial aspect for high-quality visual generation. We introduce
SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for
both image and video data. Thanks to the powerful modeling capacity of LDMs,
SlotDiffusion surpasses previous slot models in unsupervised object
segmentation and visual generation across six datasets. Furthermore, our
learned object features can be utilized by existing object-centric dynamics
models, improving video prediction quality and downstream temporal reasoning
tasks. Finally, we demonstrate the scalability of SlotDiffusion to
unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated
with self-supervised pre-trained image encoders.Comment: Project page: https://slotdiffusion.github.io/ . An earlier version
of this work appeared at the ICLR 2023 Workshop on Neurosymbolic Generative
Models: https://nesygems.github.io/assets/pdf/papers/SlotDiffusion.pd
Geometric deep learning: going beyond Euclidean data
Many scientific fields study data with an underlying structure that is a
non-Euclidean space. Some examples include social networks in computational
social sciences, sensor networks in communications, functional networks in
brain imaging, regulatory networks in genetics, and meshed surfaces in computer
graphics. In many applications, such geometric data are large and complex (in
the case of social networks, on the scale of billions), and are natural targets
for machine learning techniques. In particular, we would like to use deep
neural networks, which have recently proven to be powerful tools for a broad
range of problems from computer vision, natural language processing, and audio
analysis. However, these tools have been most successful on data with an
underlying Euclidean or grid-like structure, and in cases where the invariances
of these structures are built into networks used to model them. Geometric deep
learning is an umbrella term for emerging techniques attempting to generalize
(structured) deep neural models to non-Euclidean domains such as graphs and
manifolds. The purpose of this paper is to overview different examples of
geometric deep learning problems and present available solutions, key
difficulties, applications, and future research directions in this nascent
field
Pose Modulated Avatars from Video
It is now possible to reconstruct dynamic human motion and shape from a
sparse set of cameras using Neural Radiance Fields (NeRF) driven by an
underlying skeleton. However, a challenge remains to model the deformation of
cloth and skin in relation to skeleton pose. Unlike existing avatar models that
are learned implicitly or rely on a proxy surface, our approach is motivated by
the observation that different poses necessitate unique frequency assignments.
Neglecting this distinction yields noisy artifacts in smooth areas or blurs
fine-grained texture and shape details in sharp regions. We develop a
two-branch neural network that is adaptive and explicit in the frequency
domain. The first branch is a graph neural network that models correlations
among body parts locally, taking skeleton pose as input. The second branch
combines these correlation features to a set of global frequencies and then
modulates the feature encoding. Our experiments demonstrate that our network
outperforms state-of-the-art methods in terms of preserving details and
generalization capabilities
Compositional Scene Modeling with Global Object-Centric Representations
The appearance of the same object may vary in different scene images due to
perspectives and occlusions between objects. Humans can easily identify the
same object, even if occlusions exist, by completing the occluded parts based
on its canonical image in the memory. Achieving this ability is still a
challenge for machine learning, especially under the unsupervised learning
setting. Inspired by such an ability of humans, this paper proposes a
compositional scene modeling method to infer global representations of
canonical images of objects without any supervision. The representation of each
object is divided into an intrinsic part, which characterizes globally
invariant information (i.e. canonical representation of an object), and an
extrinsic part, which characterizes scene-dependent information (e.g., position
and size). To infer the intrinsic representation of each object, we employ a
patch-matching strategy to align the representation of a potentially occluded
object with the canonical representations of objects, and sample the most
probable canonical representation based on the category of object determined by
amortized variational inference. Extensive experiments are conducted on four
object-centric learning benchmarks, and experimental results demonstrate that
the proposed method not only outperforms state-of-the-arts in terms of
segmentation and reconstruction, but also achieves good global object
identification performance
Advancing Perception in Artificial Intelligence through Principles of Cognitive Science
Although artificial intelligence (AI) has achieved many feats at a rapid
pace, there still exist open problems and fundamental shortcomings related to
performance and resource efficiency. Since AI researchers benchmark a
significant proportion of performance standards through human intelligence,
cognitive sciences-inspired AI is a promising domain of research. Studying
cognitive science can provide a fresh perspective to building fundamental
blocks in AI research, which can lead to improved performance and efficiency.
In this review paper, we focus on the cognitive functions of perception, which
is the process of taking signals from one's surroundings as input, and
processing them to understand the environment. Particularly, we study and
compare its various processes through the lens of both cognitive sciences and
AI. Through this study, we review all current major theories from various
sub-disciplines of cognitive science (specifically neuroscience, psychology and
linguistics), and draw parallels with theories and techniques from current
practices in AI. We, hence, present a detailed collection of methods in AI for
researchers to build AI systems inspired by cognitive science. Further, through
the process of reviewing the state of cognitive-inspired AI, we point out many
gaps in the current state of AI (with respect to the performance of the human
brain), and hence present potential directions for researchers to develop
better perception systems in AI.Comment: Summary: a detailed review of the current state of perception models
through the lens of cognitive A
DNeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video
Given a monocular video, segmenting and decoupling dynamic objects while
recovering the static environment is a widely studied problem in machine
intelligence. Existing solutions usually approach this problem in the image
domain, limiting their performance and understanding of the environment. We
introduce Decoupled Dynamic Neural Radiance Field (DNeRF), a
self-supervised approach that takes a monocular video and learns a 3D scene
representation which decouples moving objects, including their shadows, from
the static background. Our method represents the moving objects and the static
background by two separate neural radiance fields with only one allowing for
temporal changes. A naive implementation of this approach leads to the dynamic
component taking over the static one as the representation of the former is
inherently more general and prone to overfitting. To this end, we propose a
novel loss to promote correct separation of phenomena. We further propose a
shadow field network to detect and decouple dynamically moving shadows. We
introduce a new dataset containing various dynamic objects and shadows and
demonstrate that our method can achieve better performance than
state-of-the-art approaches in decoupling dynamic and static 3D objects,
occlusion and shadow removal, and image segmentation for moving objects
- …