10,151 research outputs found
Open Vocabulary Scene Parsing
Recognizing arbitrary objects in the wild has been a challenging problem due
to the limitations of existing classification models and datasets. In this
paper, we propose a new task that aims at parsing scenes with a large and open
vocabulary, and several evaluation metrics are explored for this problem. Our
proposed approach to this problem is a joint image pixel and word concept
embeddings framework, where word concepts are connected by semantic relations.
We validate the open vocabulary prediction ability of our framework on ADE20K
dataset which covers a wide variety of scenes and objects. We further explore
the trained joint embedding space to show its interpretability
Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories
Attribute-based recognition models, due to their impressive performance and
their ability to generalize well on novel categories, have been widely adopted
for many computer vision applications. However, usually both the attribute
vocabulary and the class-attribute associations have to be provided manually by
domain experts or large number of annotators. This is very costly and not
necessarily optimal regarding recognition performance, and most importantly, it
limits the applicability of attribute-based models to large scale data sets. To
tackle this problem, we propose an end-to-end unsupervised attribute learning
approach. We utilize online text corpora to automatically discover a salient
and discriminative vocabulary that correlates well with the human concept of
semantic attributes. Moreover, we propose a deep convolutional model to
optimize class-attribute associations with a linguistic prior that accounts for
noise and missing data in text. In a thorough evaluation on ImageNet, we
demonstrate that our model is able to efficiently discover and learn semantic
attributes at a large scale. Furthermore, we demonstrate that our model
outperforms the state-of-the-art in zero-shot learning on three data sets:
ImageNet, Animals with Attributes and aPascal/aYahoo. Finally, we enable
attribute-based learning on ImageNet and will share the attributes and
associations for future research.Comment: Accepted as a conference paper at CVPR 201
Guiding Long-Short Term Memory for Image Caption Generation
In this work we focus on the problem of image caption generation. We propose
an extension of the long short term memory (LSTM) model, which we coin gLSTM
for short. In particular, we add semantic information extracted from the image
as extra input to each unit of the LSTM block, with the aim of guiding the
model towards solutions that are more tightly coupled to the image content.
Additionally, we explore different length normalization strategies for beam
search in order to prevent from favoring short sentences. On various benchmark
datasets such as Flickr8K, Flickr30K and MS COCO, we obtain results that are on
par with or even outperform the current state-of-the-art.Comment: accepted by ICCV 201
Automatic Concept Discovery from Parallel Text and Visual Corpora
Humans connect language and vision to perceive the world. How to build a
similar connection for computers? One possible way is via visual concepts,
which are text terms that relate to visually discriminative entities. We
propose an automatic visual concept discovery algorithm using parallel text and
visual corpora; it filters text terms based on the visual discriminative power
of the associated images, and groups them into concepts using visual and
semantic similarities. We illustrate the applications of the discovered
concepts using bidirectional image and sentence retrieval task and image
tagging task, and show that the discovered concepts not only outperform several
large sets of manually selected concepts significantly, but also achieves the
state-of-the-art performance in the retrieval task.Comment: To appear in ICCV 201
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
COMIC: Towards A Compact Image Captioning Model with Attention
Recent works in image captioning have shown very promising raw performance.
However, we realize that most of these encoder-decoder style networks with
attention do not scale naturally to large vocabulary size, making them
difficult to be deployed on embedded system with limited hardware resources.
This is because the size of word and output embedding matrices grow
proportionally with the size of vocabulary, adversely affecting the compactness
of these networks. To address this limitation, this paper introduces a brand
new idea in the domain of image captioning. That is, we tackle the problem of
compactness of image captioning models which is hitherto unexplored. We showed
that, our proposed model, named COMIC for COMpact Image Captioning, achieves
comparable results in five common evaluation metrics with state-of-the-art
approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an
embedding vocabulary size that is 39x - 99x smaller. The source code and models
are available at:
https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table
Zero-Shot Learning by Convex Combination of Semantic Embeddings
Several recent publications have proposed methods for mapping images into
continuous semantic embedding spaces. In some cases the embedding space is
trained jointly with the image transformation. In other cases the semantic
embedding space is established by an independent natural language processing
task, and then the image transformation into that space is learned in a second
stage. Proponents of these image embedding systems have stressed their
advantages over the traditional \nway{} classification framing of image
understanding, particularly in terms of the promise for zero-shot learning --
the ability to correctly annotate images of previously unseen object
categories. In this paper, we propose a simple method for constructing an image
embedding system from any existing \nway{} image classifier and a semantic word
embedding model, which contains the \n class labels in its vocabulary. Our
method maps images into the semantic embedding space via convex combination of
the class label embedding vectors, and requires no additional training. We show
that this simple and direct method confers many of the advantages associated
with more complex image embedding schemes, and indeed outperforms state of the
art methods on the ImageNet zero-shot learning task
Multi-Label Zero-Shot Learning via Concept Embedding
Zero Shot Learning (ZSL) enables a learning model to classify instances of an
unseen class during training. While most research in ZSL focuses on
single-label classification, few studies have been done in multi-label ZSL,
where an instance is associated with a set of labels simultaneously, due to the
difficulty in modeling complex semantics conveyed by a set of labels. In this
paper, we propose a novel approach to multi-label ZSL via concept embedding
learned from collections of public users' annotations of multimedia. Thanks to
concept embedding, multi-label ZSL can be done by efficiently mapping an
instance input features onto the concept embedding space in a similar manner
used in single-label ZSL. Moreover, our semantic learning model is capable of
embedding an out-of-vocabulary label by inferring its meaning from its
co-occurring labels. Thus, our approach allows both seen and unseen labels
during the concept embedding learning to be used in the aforementioned instance
mapping, which makes multi-label ZSL more flexible and suitable for real
applications. Experimental results of multi-label ZSL on images and music
tracks suggest that our approach outperforms a state-of-the-art multi-label ZSL
model and can deal with a scenario involving out-of-vocabulary labels without
re-training the semantics learning model.Comment: 15 pages. Technical Report 2016-06-01. School of Computer Science.
The University of Manchester. (Submitted to a Journal
Deep Multiple Instance Learning for Zero-shot Image Tagging
In-line with the success of deep learning on traditional recognition problem,
several end-to-end deep models for zero-shot recognition have been proposed in
the literature. These models are successful to predict a single unseen label
given an input image, but does not scale to cases where multiple unseen objects
are present. In this paper, we model this problem within the framework of
Multiple Instance Learning (MIL). To the best of our knowledge, we propose the
first end-to-end trainable deep MIL framework for the multi-label zero-shot
tagging problem. Due to its novel design, the proposed framework has several
interesting features: (1) Unlike previous deep MIL models, it does not use any
off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation.
(2) During test time, it can process any number of unseen labels given their
semantic embedding vectors. (3) Using only seen labels per image as weak
annotation, it can produce a bounding box for each predicted labels. We
experiment with the NUS-WIDE dataset and achieve superior performance across
conventional, zero-shot and generalized zero-shot tagging tasks
- …