68 research outputs found
Transductive Multi-View Zero-Shot Learning
(c) 2012. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms
Weakly Supervised Learning of Objects, Attributes and Their Associations
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-10605-2_31]”
Learning Multimodal Latent Attributes
Abstract—The rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. Attribute learning has emerged as a promising paradigm for bridging the semantic gap and addressing data sparsity via transferring attribute knowledge in object recognition and relatively simple action classification. In this paper, we address the task of attribute learning for understanding multimedia data with sparse and incomplete labels. In particular we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multi-modal content and complex and unstructured nature relative to the density of annotations. To solve this problem, we (1) introduce a concept of semi-latent attribute space, expressing user-defined and latent attributes in a unified framework, and (2) propose a novel scalable probabilistic topic model for learning multi-modal semi-latent attributes, which dramatically reduces requirements for an exhaustive accurate attribute ontology and expensive annotation effort. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multimedia sparse data learning tasks including: multi-task learning, learning with label noise, N-shot transfer learning and importantly zero-shot learning
Bayesian Joint Modelling for Object Localisation in Weakly Labelled Images
Abstract—We address the problem of localisation of objects as bounding boxes in images and videos with weak labels. This weakly supervised object localisation problem has been tackled in the past using discriminative models where each object class is localised independently from other classes. In this paper, a novel framework based on Bayesian joint topic modelling is proposed, which differs significantly from the existing ones in that: (1) All foreground object classes are modelled jointly in a single generative model that encodes multiple object co-existence so that “explaining away ” inference can resolve ambiguity and lead to better learning and localisation. (2) Image backgrounds are shared across classes to better learn varying surroundings and “push out ” objects of interest. (3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning. Moreover, the Bayesian formulation enables the exploitation of various types of prior knowledge to compensate for the limited supervision offered by weakly labelled data, as well as Bayesian domain adaptation for transfer learning. Extensive experiments on the PASCAL VOC, ImageNet and YouTube-Object videos datasets demonstrate the effectiveness of our Bayesian joint model for weakly supervised object localisation
Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation
Abstract. Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A projection from a low-level feature space to the seman-tic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inher-ent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift prob-lem and propose a novel framework, transductive multi-view embedding, to solve it. It is ‘transductive ’ in that unlabelled target data points are explored for projection adaptation, and ‘multi-view ’ in that both low-level feature (view) and multiple semantic representations (views) are embedded to rectify the projection shift. We demonstrate through ex-tensive experiments that our framework (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complemen-tarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) en-ables novel cross-view annotation tasks.
Semantic Regularisation for Recurrent Image Annotation
The "CNN-RNN" design pattern is increasingly widely applied in a variety of
image annotation tasks including multi-label classification and captioning.
Existing models use the weakly semantic CNN hidden layer or its transform as
the image embedding that provides the interface between the CNN and RNN. This
leaves the RNN overstretched with two jobs: predicting the visual concepts and
modelling their correlations for generating structured annotation output.
Importantly this makes the end-to-end training of the CNN and RNN slow and
ineffective due to the difficulty of back propagating gradients through the RNN
to train the CNN. We propose a simple modification to the design pattern that
makes learning more effective and efficient. Specifically, we propose to use a
semantically regularised embedding layer as the interface between the CNN and
RNN. Regularising the interface can partially or completely decouple the
learning problems, allowing each to be more effectively trained and jointly
training much more efficient. Extensive experiments show that state-of-the art
performance is achieved on multi-label classification as well as image
captioning
- …