5,085 research outputs found
Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation
Recently, automatic image caption generation has been an important focus of
the work on multimodal translation task. Existing approaches can be roughly
categorized into two classes, i.e., top-down and bottom-up, the former
transfers the image information (called as visual-level feature) directly into
a caption, and the later uses the extracted words (called as semanticlevel
attribute) to generate a description. However, previous methods either are
typically based one-stage decoder or partially utilize part of visual-level or
semantic-level information for image caption generation. In this paper, we
address the problem and propose an innovative multi-stage architecture (called
as Stack-VS) for rich fine-gained image caption generation, via combining
bottom-up and top-down attention models to effectively handle both visual-level
and semantic-level information of an input image. Specifically, we also propose
a novel well-designed stack decoder model, which is constituted by a sequence
of decoder cells, each of which contains two LSTM-layers work interactively to
re-optimize attention weights on both visual-level feature vectors and
semantic-level attribute embeddings for generating a fine-gained image caption.
Extensive experiments on the popular benchmark dataset MSCOCO show the
significant improvements on different evaluation metrics, i.e., the
improvements on BLEU-4/CIDEr/SPICE scores are 0.372, 1.226 and 0.216,
respectively, as compared to the state-of-the-arts.Comment: 12 pages, 7 figure
Image Captioning with Semantic Attention
Automatically generating a natural language description of an image has
attracted interests recently both because of its importance in practical
applications and because it connects two major artificial intelligence fields:
computer vision and natural language processing. Existing approaches are either
top-down, which start from a gist of an image and convert it into words, or
bottom-up, which come up with words describing various aspects of an image and
then combine them. In this paper, we propose a new algorithm that combines both
approaches through a model of semantic attention. Our algorithm learns to
selectively attend to semantic concept proposals and fuse them into hidden
states and outputs of recurrent neural networks. The selection and fusion form
a feedback connecting the top-down and bottom-up computation. We evaluate our
algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental
results show that our algorithm significantly outperforms the state-of-the-art
approaches consistently across different evaluation metrics.Comment: 10 pages, 5 figures, CVPR1
Phrase-based Image Captioning with Hierarchical LSTM Model
Automatic generation of caption to describe the content of an image has been
gaining a lot of research interests recently, where most of the existing works
treat the image caption as pure sequential data. Natural language, however
possess a temporal hierarchy structure, with complex dependencies between each
subsequence. In this paper, we propose a phrase-based hierarchical Long
Short-Term Memory (phi-LSTM) model to generate image description. In contrast
to the conventional solutions that generate caption in a pure sequential
manner, our proposed model decodes image caption from phrase to sentence. It
consists of a phrase decoder at the bottom hierarchy to decode noun phrases of
variable length, and an abbreviated sentence decoder at the upper hierarchy to
decode an abbreviated form of the image description. A complete image caption
is formed by combining the generated phrases with sentence during the inference
stage. Empirically, our proposed model shows a better or competitive result on
the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the
art models. We also show that our proposed model is able to generate more novel
captions (not seen in the training data) which are richer in word contents in
all these three datasets.Comment: 17 pages, 12 figures, ACCV2016 extension, phrase-based image
captionin
Describing Natural Images Containing Novel Objects with Knowledge Guided Assitance
Images in the wild encapsulate rich knowledge about varied abstract concepts
and cannot be sufficiently described with models built only using image-caption
pairs containing selected objects. We propose to handle such a task with the
guidance of a knowledge base that incorporate many abstract concepts. Our
method is a two-step process where we first build a multi-entity-label image
recognition model to predict abstract concepts as image labels and then
leverage them in the second step as an external semantic attention and
constrained inference in the caption generation model for describing images
that depict unseen/novel objects. Evaluations show that our models outperform
most of the prior work for out-of-domain captioning on MSCOCO and are useful
for integration of knowledge and vision in general.Comment: 10 pages, 5 figure
Let's Transfer Transformations of Shared Semantic Representations
With a good image understanding capability, can we manipulate the images high
level semantic representation? Such transformation operation can be used to
generate or retrieve similar images but with a desired modification (for
example changing beach background to street background); similar ability has
been demonstrated in zero shot learning, attribute composition and attribute
manipulation image search. In this work we show how one can learn
transformations with no training examples by learning them on another domain
and then transfer to the target domain. This is feasible if: first,
transformation training data is more accessible in the other domain and second,
both domains share similar semantics such that one can learn transformations in
a shared embedding space. We demonstrate this on an image retrieval task where
search query is an image, plus an additional transformation specification (for
example: search for images similar to this one but background is a street
instead of a beach). In one experiment, we transfer transformation from
synthesized 2D blobs image to 3D rendered image, and in the other, we transfer
from text domain to natural image domain
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image
captioning requires to recognize the important objects, their attributes and
their relationships in an image. It also needs to generate syntactically and
semantically correct sentences. Deep learning-based techniques are capable of
handling the complexities and challenges of image captioning. In this survey
paper, we aim to present a comprehensive review of existing deep learning-based
image captioning techniques. We discuss the foundation of the techniques to
analyze their performances, strengths and limitations. We also discuss the
datasets and the evaluation metrics popularly used in deep learning based
automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys
(October 2018
A Weighted Multi-Criteria Decision Making Approach for Image Captioning
Image captioning aims at automatically generating descriptions of an image in
natural language. This is a challenging problem in the field of artificial
intelligence that has recently received significant attention in the computer
vision and natural language processing. Among the existing approaches, visual
retrieval based methods have been proven to be highly effective. These
approaches search for similar images, then build a caption for the query image
based on the captions of the retrieved images. In this study, we present a
method for visual retrieval based image captioning, in which we use a multi
criteria decision making algorithm to effectively combine several criteria with
proportional impact weights to retrieve the most relevant caption for the query
image. The main idea of the proposed approach is to design a mechanism to
retrieve more semantically relevant captions with the query image and then
selecting the most appropriate caption by imitation of the human act based on a
weighted multi-criteria decision making algorithm. Experiments conducted on MS
COCO benchmark dataset have shown that proposed method provides much more
effective results in compare to the state-of-the-art models by using criteria
with proportional impact weights .Comment: 12 page
Scene Graph Generation from Objects, Phrases and Region Captions
Object detection, scene graph generation and region captioning, which are
three scene understanding tasks at different semantic levels, are tied
together: scene graphs are generated on top of objects detected in an image
with their pairwise relationship predicted, while region captioning gives a
language description of the objects, their attributes, relations, and other
context information. In this work, to leverage the mutual connections across
semantic levels, we propose a novel neural network model, termed as Multi-level
Scene Description Network (denoted as MSDN), to solve the three vision tasks
jointly in an end-to-end manner. Objects, phrases, and caption regions are
first aligned with a dynamic graph based on their spatial and semantic
connections. Then a feature refining structure is used to pass messages across
the three levels of semantic tasks through the graph. We benchmark the learned
model on three tasks, and show the joint learning across three tasks with our
proposed method can bring mutual improvements over previous models.
Particularly, on the scene graph generation task, our proposed method
outperforms the state-of-art method with more than 3% margin.Comment: accepted by ICCV 201
Multimodal Semantic Attention Network for Video Captioning
Inspired by the fact that different modalities in videos carry complementary
information, we propose a Multimodal Semantic Attention Network(MSAN), which is
a new encoder-decoder framework incorporating multimodal semantic attributes
for video captioning. In the encoding phase, we detect and generate multimodal
semantic attributes by formulating it as a multi-label classification problem.
Moreover, we add auxiliary classification loss to our model that can obtain
more effective visual features and high-level multimodal semantic attribute
distributions for sufficient video encoding. In the decoding phase, we extend
each weight matrix of the conventional LSTM to an ensemble of
attribute-dependent weight matrices, and employ attention mechanism to pay
attention to different attributes at each time of the captioning process. We
evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT,
achieving competitive results with current state-of-the-art across six
evaluation metrics.Comment: 6 pages, 4 figures, accepted by IEEE International Conference on
Multimedia and Expo (ICME) 201
- …