4,635 research outputs found
What does a Car-ssette tape tell?
Captioning has attracted much attention in image and video understanding
while little work examines audio captioning. This paper contributes a
manually-annotated dataset on car scene, in extension to a previously published
hospital audio captioning dataset. An encoder-decoder model with pretrained
word embeddings and additional sentence loss is proposed. This current model
can accelerate the training process and generate semantically correct but
unseen unique sentences. We test the model on the current car dataset, previous
Hospital Dataset and the Joint Dataset, indicating its generalization
capability across different scenes. Further, we make an effort to provide a
better objective evaluation metric, namely the BERT similarity score. It
compares the semantic-level similarity and compensates for drawbacks of N-gram
based metrics like BLEU, namely high scores for word-similar sentences. This
new metric demonstrates higher correlation with human evaluation. However,
though detailed audio captions can now be automatically generated, human
annotations still outperform model captions in many aspects
Exploring Visual Relationship for Image Captioning
It is always well believed that modeling relationships between objects would
be helpful for representing and eventually describing an image. Nevertheless,
there has not been evidence in support of the idea on image description
generation. In this paper, we introduce a new design to explore the connections
between objects for image captioning under the umbrella of attention-based
encoder-decoder framework. Specifically, we present Graph Convolutional
Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that
novelly integrates both semantic and spatial object relationships into image
encoder. Technically, we build graphs over the detected objects in an image
based on their spatial and semantic connections. The representations of each
region proposed on objects are then refined by leveraging graph structure
through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on
LSTM-based captioning framework with attention mechanism for sentence
generation. Extensive experiments are conducted on COCO image captioning
dataset, and superior results are reported when comparing to state-of-the-art
approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1%
to 128.7% on COCO testing set.Comment: ECCV 201
Pointing Novel Objects in Image Captioning
Image captioning has received significant attention with remarkable
improvements in recent advances. Nevertheless, images in the wild encapsulate
rich knowledge and cannot be sufficiently described with models built on
image-caption pairs containing only in-domain objects. In this paper, we
propose to address the problem by augmenting standard deep captioning
architectures with object learners. Specifically, we present Long Short-Term
Memory with Pointing (LSTM-P) --- a new architecture that facilitates
vocabulary expansion and produces novel objects via pointing mechanism.
Technically, object learners are initially pre-trained on available object
recognition data. Pointing in LSTM-P then balances the probability between
generating a word through LSTM and copying a word from the recognized objects
at each time step in decoder stage. Furthermore, our captioning encourages
global coverage of objects in the sentence. Extensive experiments are conducted
on both held-out COCO image captioning and ImageNet datasets for describing
novel objects, and superior results are reported when comparing to
state-of-the-art approaches. More remarkably, we obtain an average of 60.9% in
F1 score on held-out COCO~dataset.Comment: CVPR 201
Good News, Everyone! Context driven entity-aware captioning for news images
Current image captioning systems perform at a merely descriptive level,
essentially enumerating the objects in the scene and their relations. Humans,
on the contrary, interpret images by integrating several sources of prior
knowledge of the world. In this work, we aim to take a step closer to producing
captions that offer a plausible interpretation of the scene, by integrating
such contextual information into the captioning pipeline. For this we focus on
the captioning of images used to illustrate news articles. We propose a novel
captioning method that is able to leverage contextual information provided by
the text of news articles associated with an image. Our model is able to
selectively draw information from the article guided by visual cues, and to
dynamically extend the output dictionary to out-of-vocabulary named entities
that appear in the context source. Furthermore we introduce `GoodNews', the
largest news image captioning dataset in the literature and demonstrate
state-of-the-art results.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2019
Recurrent Topic-Transition GAN for Visual Paragraph Generation
A natural image usually conveys rich semantic content and can be viewed from
different angles. Existing image description methods are largely restricted by
small sets of biased visual paragraph annotations, and fail to cover rich
underlying semantics. In this paper, we investigate a semi-supervised paragraph
generative framework that is able to synthesize diverse and semantically
coherent paragraph descriptions by reasoning over local semantic regions and
exploiting linguistic knowledge. The proposed Recurrent Topic-Transition
Generative Adversarial Network (RTT-GAN) builds an adversarial framework
between a structured paragraph generator and multi-level paragraph
discriminators. The paragraph generator generates sentences recurrently by
incorporating region-based visual and language attention mechanisms at each
step. The quality of generated paragraph sentences is assessed by multi-level
adversarial discriminators from two aspects, namely, plausibility at sentence
level and topic-transition coherence at paragraph level. The joint adversarial
training of RTT-GAN drives the model to generate realistic paragraphs with
smooth logical transition between sentence topics. Extensive quantitative
experiments on image and video paragraph datasets demonstrate the effectiveness
of our RTT-GAN in both supervised and semi-supervised settings. Qualitative
results on telling diverse stories for an image also verify the
interpretability of RTT-GAN.Comment: 10 pages, 6 figure
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
CNN+CNN: Convolutional Decoders for Image Captioning
Image captioning is a challenging task that combines the field of computer
vision and natural language processing. A variety of approaches have been
proposed to achieve the goal of automatically describing an image, and
recurrent neural network (RNN) or long-short term memory (LSTM) based models
dominate this field. However, RNNs or LSTMs cannot be calculated in parallel
and ignore the underlying hierarchical structure of a sentence. In this paper,
we propose a framework that only employs convolutional neural networks (CNNs)
to generate captions. Owing to parallel computing, our basic model is around 3
times faster than NIC (an LSTM-based model) during training time, while also
providing better results. We conduct extensive experiments on MSCOCO and
investigate the influence of the model width and depth. Compared with
LSTM-based models that apply similar attention mechanisms, our proposed models
achieves comparable scores of BLEU-1,2,3,4 and METEOR, and higher scores of
CIDEr. We also test our model on the paragraph annotation dataset, and get
higher CIDEr score compared with hierarchical LSTM
A sequential guiding network with attention for image captioning
The recent advances of deep learning in both computer vision (CV) and natural
language processing (NLP) provide us a new way of understanding semantics, by
which we can deal with more challenging tasks such as automatic description
generation from natural images. In this challenge, the encoder-decoder
framework has achieved promising performance when a convolutional neural
network (CNN) is used as image encoder and a recurrent neural network (RNN) as
decoder. In this paper, we introduce a sequential guiding network that guides
the decoder during word generation. The new model is an extension of the
encoder-decoder framework with attention that has an additional guiding long
short-term memory (LSTM) and can be trained in an end-to-end manner by using
image/descriptions pairs. We validate our approach by conducting extensive
experiments on a benchmark dataset, i.e., MS COCO Captions. The proposed model
achieves significant improvement comparing to the other state-of-the-art deep
learning models.Comment: 5 pages, 2 figures, 1 table, IEEE ICASSP 201
Object Hallucination in Image Captioning
Despite continuously improving performance, contemporary image captioning
models are prone to "hallucinating" objects that are not actually in a scene.
One problem is that standard metrics only measure similarity to ground truth
captions and may not fully capture image relevance. In this work, we propose a
new image relevance metric to evaluate current models with veridical visual
labels and assess their rate of object hallucination. We analyze how captioning
model architectures and learning objectives contribute to object hallucination,
explore when hallucination is likely due to image misclassification or language
priors, and assess how well current sentence metrics capture object
hallucination. We investigate these questions on the standard image captioning
benchmark, MSCOCO, using a diverse set of models. Our analysis yields several
interesting findings, including that models which score best on standard
sentence metrics do not always have lower hallucination and that models which
hallucinate more tend to make errors driven by language priors.Comment: Rohrbach and Hendricks contributed equally; accepted to EMNLP 201
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image
captioning requires to recognize the important objects, their attributes and
their relationships in an image. It also needs to generate syntactically and
semantically correct sentences. Deep learning-based techniques are capable of
handling the complexities and challenges of image captioning. In this survey
paper, we aim to present a comprehensive review of existing deep learning-based
image captioning techniques. We discuss the foundation of the techniques to
analyze their performances, strengths and limitations. We also discuss the
datasets and the evaluation metrics popularly used in deep learning based
automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys
(October 2018
- …