372,241 research outputs found

    Representing an Object by Interchanging What with Where

    Get PDF
    Exploring representations is a fundamental step towards understanding vision. The visual system carries two types of information along separate pathways: One is about what it is and the other is about where it is. Initially, the what is represented by a pattern of activity that is distributed across millions of photoreceptors, whereas the where is 'implicitly' given as their retinotopic positions. Many computational theories of object recognition rely on such pixel-based representations, but they are insufficient to learn spatial information such as position and size due to the implicit encoding of the where information. 
Here we try transforming a retinal image of an object into its internal image via interchanging the what with the where, which means that patterns of intensity in internal image describe the spatial information rather than the object information. To be concrete, the retinal image of an object is deformed and turned over into a negative image, in which light areas appear dark and vice versa, and the object's spatial information is quantified with levels of intensity on borders of that image. 
Interestingly, the inner part excluding the borders of the internal image shows the position and scale invariance. In order to further understand how the internal image associates the what and where, we examined the internal image of a face which moves or is scaled on the retina. As a result, we found that the internal images form a linear vector space under the object translation and scaling. 
In conclusion, these results show that the what-where interchangeability might play an important role for organizing those two into internal representation of brain

    SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

    Full text link
    Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.Comment: Project page: https://slotdiffusion.github.io/ . An earlier version of this work appeared at the ICLR 2023 Workshop on Neurosymbolic Generative Models: https://nesygems.github.io/assets/pdf/papers/SlotDiffusion.pd

    Graphical Object-Centric Actor-Critic

    Full text link
    There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm

    Object tracking and matting for A class of dynamic image-based representations

    Get PDF
    Image-based rendering (IBR) is an emerging technology for photo-realistic rendering of scenes from a collection of densely sampled images and videos. Recently, an object-based approach for a class of dynamic image-based representations called plenoptic videos was proposed. This paper proposes an automatic object tracking approach using the level-set method. Our tracking method, which utilizes both local and global features of the image sequences instead of global features exploited in previous approach, can achieve better tracking results for objects, especially with non-uniform energy distribution. Due to possible segmentation errors around object boundaries, natural matting with Bayesian approach is also incorporated into our system. Furthermore, a MPEG-4 like object-based algorithm is developed for compressing the plenoptic videos, which consist of the alpha maps, depth maps and textures of the segmented image-based objects from different video plenoptic streams. Experimental results show that satisfactory renderings can be obtained by the proposed approaches. © 2005 IEEE.published_or_final_versio

    Light Field Morphable Models

    Get PDF
    Statistical shape and texture appearance models are powerful image representations, but previously had been restricted to 2D or simple 3D shapes. In this paper we present a novel 3D morphable model based on image-based rendering techniques, which can represent complex lighting conditions, structures, and surfaces. We describe how to construct a manifold of the multi-view appearance of an object class using light fields and show how to match a 2D image of an object to a point on this manifold. In turn we use the reconstructed light field to render novel views of the object. Our technique overcomes the limitations of polygon based appearance models and uses light fields that are acquired in real-time
    corecore