16 research outputs found
Learning object-centric representations
Whenever an agent interacts with its environment, it has to take into account and interact with any objects present in this environment.
And yet, the majority of machine learning solutions either treat objects only implicitly or employ highly-engineered solutions that account for objects through object detection algorithms.
In this thesis, we explore supervised and unsupervised methods for learning object-centric representations from vision.
We focus on end-to-end learning, where information about objects can be extracted directly from images, and where every object can be separately described by a single vector-valued variable. Specifically, we present three novel methods:
• HART and MOHART, which track single- and multiple-objects in video, respectively, by using RNNS with a hierarchy of differentiable attention mechanisms. These algorithms learn to anticipate future appearance changes and movement of tracking objects, thereby learning representations that describe every tracked object separately.
• SQAIR, a VAE-based generative model of moving objects, which explicitly models disappearance and appearance of new objects in the scene. It models every object with a separate latent variable, and disentangles appearance, position and scale of each object. Posterior inference in this model allows for unsupervised object detection and tracking.
• SCAE, an unsupervised autoencoder with in-built knowledge of two-dimensional geometry and object-part decomposition, which is based on capsule networks. It learns to discover parts present in an image, and group those parts into objects. Each object is modelled by a separate object capsule, whose activation probability is highly correlated with the object class, therefore allowing for state-of-the-art unsupervised image classification
Deep Attention Models for Human Tracking Using RGBD
Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information
Variational Saccading: Efficient Inference for Large Resolution Images
Image classification with deep neural networks is typically restricted to
images of small dimensionality such as 224 x 244 in Resnet models [24]. This
limitation excludes the 4000 x 3000 dimensional images that are taken by modern
smartphone cameras and smart devices. In this work, we aim to mitigate the
prohibitive inferential and memory costs of operating in such large dimensional
spaces. To sample from the high-resolution original input distribution, we
propose using a smaller proxy distribution to learn the co-ordinates that
correspond to regions of interest in the high-dimensional space. We introduce a
new principled variational lower bound that captures the relationship of the
proxy distribution's posterior and the original image's co-ordinate space in a
way that maximizes the conditional classification likelihood. We empirically
demonstrate on one synthetic benchmark and one real world large resolution DSLR
camera image dataset that our method produces comparable results with ~10x
faster inference and lower memory consumption than a model that utilizes the
entire original input distribution. Finally, we experiment with a more complex
setting using mini-maps from Starcraft II [56] to infer the number of
characters in a complex 3d-rendered scene. Even in such complicated scenes our
model provides strong localization: a feature missing from traditional
classification models.Comment: Published BMVC 2019 & NIPS 2018 Bayesian Deep Learning Worksho