729 research outputs found
3G structure for image caption generation
It is a big challenge of computer vision to make machine automatically
describe the content of an image with a natural language sentence. Previous
works have made great progress on this task, but they only use the global or
local image feature, which may lose some important subtle or global information
of an image. In this paper, we propose a model with 3-gated model which fuses
the global and local image features together for the task of image caption
generation. The model mainly has three gated structures. 1) Gate for the global
image feature, which can adaptively decide when and how much the global image
feature should be imported into the sentence generator. 2) The gated recurrent
neural network (RNN) is used as the sentence generator. 3) The gated feedback
method for stacking RNN is employed to increase the capability of nonlinearity
fitting. More specially, the global and local image features are combined
together in this paper, which makes full use of the image information. The
global image feature is controlled by the first gate and the local image
feature is selected by the attention mechanism. With the latter two gates, the
relationship between image and text can be well explored, which improves the
performance of the language part as well as the multi-modal embedding part.
Experimental results show that our proposed method outperforms the
state-of-the-art for image caption generation.Comment: 35 pages, 7 figures, magazin
Exploring Models and Data for Remote Sensing Image Caption Generation
Inspired by recent development of artificial satellite, remote sensing images
have attracted extensive attention. Recently, noticeable progress has been made
in scene classification and target detection.However, it is still not clear how
to describe the remote sensing image content with accurate and concise
sentences. In this paper, we investigate to describe the remote sensing images
with accurate and flexible sentences. First, some annotated instructions are
presented to better describe the remote sensing images considering the special
characteristics of remote sensing images. Second, in order to exhaustively
exploit the contents of remote sensing images, a large-scale aerial image data
set is constructed for remote sensing image caption. Finally, a comprehensive
review is presented on the proposed data set to fully advance the task of
remote sensing caption. Extensive experiments on the proposed data set
demonstrate that the content of the remote sensing image can be completely
described by generating language descriptions. The data set is available at
https://github.com/201528014227051/RSICD_optimalComment: 14 pages, 8 figure
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
Image Captioning based on Deep Reinforcement Learning
Recently it has shown that the policy-gradient methods for reinforcement
learning have been utilized to train deep end-to-end systems on natural
language processing tasks. What's more, with the complexity of understanding
image content and diverse ways of describing image content in natural language,
image captioning has been a challenging problem to deal with. To the best of
our knowledge, most state-of-the-art methods follow a pattern of sequential
model, such as recurrent neural networks (RNN). However, in this paper, we
propose a novel architecture for image captioning with deep reinforcement
learning to optimize image captioning tasks. We utilize two networks called
"policy network" and "value network" to collaboratively generate the captions
of images. The experiments are conducted on Microsoft COCO dataset, and the
experimental results have verified the effectiveness of the proposed method
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
We present a new large-scale multilingual video description dataset, VATEX,
which contains over 41,250 videos and 825,000 captions in both English and
Chinese. Among the captions, there are over 206,000 English-Chinese parallel
translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is
multilingual, larger, linguistically complex, and more diverse in terms of both
video and natural language descriptions. We also introduce two tasks for
video-and-language research based on VATEX: (1) Multilingual Video Captioning,
aimed at describing a video in various languages with a compact unified
captioning model, and (2) Video-guided Machine Translation, to translate a
source language description into the target language using the video
information as additional spatiotemporal context. Extensive experiments on the
VATEX dataset show that, first, the unified multilingual model can not only
produce both English and Chinese descriptions for a video more efficiently, but
also offer improved performance over the monolingual models. Furthermore, we
demonstrate that the spatiotemporal video context can be effectively utilized
to align source and target languages and thus assist machine translation. In
the end, we discuss the potentials of using VATEX for other video-and-language
research.Comment: ICCV 2019 Oral. 17 pages, 14 figures, 6 tables (updated the VATEX
website link: vatex-challenge.org
Image Inspired Poetry Generation in XiaoIce
Vision is a common source of inspiration for poetry. The objects and the
sentimental imprints that one perceives from an image may lead to various
feelings depending on the reader. In this paper, we present a system of poetry
generation from images to mimic the process. Given an image, we first extract a
few keywords representing objects and sentiments perceived from the image.
These keywords are then expanded to related ones based on their associations in
human written poems. Finally, verses are generated gradually from the keywords
using recurrent neural networks trained on existing poems. Our approach is
evaluated by human assessors and compared to other generation baselines. The
results show that our method can generate poems that are more artistic than the
baseline methods. This is one of the few attempts to generate poetry from
images. By deploying our proposed approach, XiaoIce has already generated more
than 12 million poems for users since its release in July 2017. A book of its
poems has been published by Cheers Publishing, which claimed that the book is
the first-ever poetry collection written by an AI in human history
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.Comment: submitted to a journa
Unsupervised Image Captioning
Deep neural networks have achieved great successes on the image captioning
task. However, most of the existing models depend heavily on paired
image-sentence datasets, which are very expensive to acquire. In this paper, we
make the first attempt to train an image captioning model in an unsupervised
manner. Instead of relying on manually labeled image-sentence pairs, our
proposed model merely requires an image set, a sentence corpus, and an existing
visual concept detector. The sentence corpus is used to teach the captioning
model how to generate plausible sentences. Meanwhile, the knowledge in the
visual concept detector is distilled into the captioning model to guide the
model to recognize the visual concepts in an image. In order to further
encourage the generated captions to be semantically consistent with the image,
the image and caption are projected into a common latent space so that they can
reconstruct each other. Given that the existing sentence corpora are mainly
designed for linguistic research and are thus with little reference to image
contents, we crawl a large-scale image description corpus of two million
natural sentences to facilitate the unsupervised image captioning scenario.
Experimental results show that our proposed model is able to produce quite
promising results without any caption annotations
Natural Language Generation in Dialogue using Lexicalized and Delexicalized Data
Natural language generation plays a critical role in spoken dialogue systems.
We present a new approach to natural language generation for task-oriented
dialogue using recurrent neural networks in an encoder-decoder framework. In
contrast to previous work, our model uses both lexicalized and delexicalized
components i.e. slot-value pairs for dialogue acts, with slots and
corresponding values aligned together. This allows our model to learn from all
available data including the slot-value pairing, rather than being restricted
to delexicalized slots. We show that this helps our model generate more natural
sentences with better grammar. We further improve our model's performance by
transferring weights learnt from a pretrained sentence auto-encoder. Human
evaluation of our best-performing model indicates that it generates sentences
which users find more appealing
Multimodal Semantic Attention Network for Video Captioning
Inspired by the fact that different modalities in videos carry complementary
information, we propose a Multimodal Semantic Attention Network(MSAN), which is
a new encoder-decoder framework incorporating multimodal semantic attributes
for video captioning. In the encoding phase, we detect and generate multimodal
semantic attributes by formulating it as a multi-label classification problem.
Moreover, we add auxiliary classification loss to our model that can obtain
more effective visual features and high-level multimodal semantic attribute
distributions for sufficient video encoding. In the decoding phase, we extend
each weight matrix of the conventional LSTM to an ensemble of
attribute-dependent weight matrices, and employ attention mechanism to pay
attention to different attributes at each time of the captioning process. We
evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT,
achieving competitive results with current state-of-the-art across six
evaluation metrics.Comment: 6 pages, 4 figures, accepted by IEEE International Conference on
Multimedia and Expo (ICME) 201
- …