5,194 research outputs found
Weakly Supervised Learning of Objects, Attributes and Their Associations
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-10605-2_31]”
Weakly Supervised Learning of Objects and Attributes.
PhDThis thesis presents weakly supervised learning approaches to directly
exploit image-level tags (e.g. objects, attributes) for comprehensive
image understanding, including tasks such as object localisation, image
description, image retrieval, semantic segmentation, person re-identification
and person search, etc. Unlike the conventional approaches which tackle
weakly supervised problem by learning a discriminative model, a generative
Bayesian framework is proposed which provides better mechanisms
to resolve the ambiguity problem. The proposed model significantly differentiates
from the existing approaches in that: (1) All foreground object
classes are modelled jointly in a single generative model that encodes multiple
objects co-existence so that “explaining away” inference can resolve
ambiguity and lead to better learning. (2) Image backgrounds are shared
across classes to better learn varying surroundings and “push out” objects
of interest. (3) the Bayesian formulation enables the exploitation of various
types of prior knowledge to compensate for the limited supervision
offered by weakly labelled data, as well as Bayesian domain adaptation
for transfer learning.
Detecting objects is the first and critical component in image understanding
paradigm. Unlike conventional fully supervised object detection
approaches, the proposed model aims to train an object detector
from weakly labelled data. A novel framework based on Bayesian latent
topic model is proposed to address the problem of localisation of objects
as bounding boxes in images and videos with image level object labels.
The inferred object location can be then used as the annotation to train a
classic object detector with conventional approaches.
However, objects cannot tell the whole story in an image. Beyond detecting
objects, a general visual model should be able to describe objects
and segment them at a pixel level. Another limitation of the initial model is
that it still requires an additional object detector. To remedy the above two
drawbacks, a novel weakly supervised non-parametric Bayesian model is
presented to model objects, attributes and their associations automatically
from weakly labelled images. Once learned, given a new image, the proposed
model can describe the image with the combination of objects and
attributes, as well as their locations and segmentation.
Finally, this thesis further tackles the weakly supervised learning problem
from a transfer learning perspective, by considering the fact that there
are always some fully labelled or weakly labelled data available in a related
domain while only insufficient labelled data exist for training in the
target domain. A powerful semantic description is transferred from the existing
fashion photography datasets to surveillance data to solve the person
re-identification problem
LOCL: Learning Object-Attribute Composition using Localization
This paper describes LOCL (Learning Object Attribute Composition using
Localization) that generalizes composition zero shot learning to objects in
cluttered and more realistic settings. The problem of unseen Object Attribute
(OA) associations has been well studied in the field, however, the performance
of existing methods is limited in challenging scenes. In this context, our key
contribution is a modular approach to localizing objects and attributes of
interest in a weakly supervised context that generalizes robustly to unseen
configurations. Localization coupled with a composition classifier
significantly outperforms state of the art (SOTA) methods, with an improvement
of about 12% on currently available challenging datasets. Further, the
modularity enables the use of localized feature extractor to be used with
existing OA compositional learning methods to improve their overall
performance.Comment: 20 pages, 7 figures, 11 tables, Accepted in British Machine Vision
Conference 202
A Discriminative Representation of Convolutional Features for Indoor Scene Recognition
Indoor scene recognition is a multi-faceted and challenging problem due to
the diverse intra-class variations and the confusing inter-class similarities.
This paper presents a novel approach which exploits rich mid-level
convolutional features to categorize indoor scenes. Traditionally used
convolutional features preserve the global spatial structure, which is a
desirable property for general object recognition. However, we argue that this
structuredness is not much helpful when we have large variations in scene
layouts, e.g., in indoor scenes. We propose to transform the structured
convolutional activations to another highly discriminative feature space. The
representation in the transformed space not only incorporates the
discriminative aspects of the target dataset, but it also encodes the features
in terms of the general object categories that are present in indoor scenes. To
this end, we introduce a new large-scale dataset of 1300 object categories
which are commonly present in indoor scenes. Our proposed approach achieves a
significant performance boost over previous state of the art approaches on five
major scene classification datasets
Areas of Attention for Image Captioning
We propose "Areas of Attention", a novel attention-based model for automatic
image captioning. Our approach models the dependencies between image regions,
caption words, and the state of an RNN language model, using three pairwise
interactions. In contrast to previous attention-based approaches that associate
image regions only to the RNN state, our method allows a direct association
between caption words and image regions. During training these associations are
inferred from image-level captions, akin to weakly-supervised object detector
training. These associations help to improve captioning by localizing the
corresponding regions during testing. We also propose and compare different
ways of generating attention areas: CNN activation grids, object proposals, and
spatial transformers nets applied in a convolutional fashion. Spatial
transformers give the best results. They allow for image specific attention
areas, and can be trained jointly with the rest of the network. Our attention
mechanism and spatial transformer attention areas together yield
state-of-the-art results on the MSCOCO dataset.o meaningful latent semantic
structure in the generated captions.Comment: Accepted in ICCV 201
Learning Multimodal Latent Attributes
Abstract—The rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. Attribute learning has emerged as a promising paradigm for bridging the semantic gap and addressing data sparsity via transferring attribute knowledge in object recognition and relatively simple action classification. In this paper, we address the task of attribute learning for understanding multimedia data with sparse and incomplete labels. In particular we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multi-modal content and complex and unstructured nature relative to the density of annotations. To solve this problem, we (1) introduce a concept of semi-latent attribute space, expressing user-defined and latent attributes in a unified framework, and (2) propose a novel scalable probabilistic topic model for learning multi-modal semi-latent attributes, which dramatically reduces requirements for an exhaustive accurate attribute ontology and expensive annotation effort. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multimedia sparse data learning tasks including: multi-task learning, learning with label noise, N-shot transfer learning and importantly zero-shot learning
- …