1,214 research outputs found
Human Attention in Image Captioning: Dataset and Analysis
In this work, we present a novel dataset consisting of eye movements and
verbal descriptions recorded synchronously over images. Using this data, we
study the differences in human attention during free-viewing and image
captioning tasks. We look into the relationship between human attention and
language constructs during perception and sentence articulation. We also
analyse attention deployment mechanisms in the top-down soft attention approach
that is argued to mimic human attention in captioning tasks, and investigate
whether visual saliency can help image captioning. Our study reveals that (1)
human attention behaviour differs in free-viewing and image description tasks.
Humans tend to fixate on a greater variety of regions under the latter task,
(2) there is a strong relationship between described objects and attended
objects ( of the described objects are being attended), (3) a
convolutional neural network as feature encoder accounts for human-attended
regions during image captioning to a great extent (around ), (4)
soft-attention mechanism differs from human attention, both spatially and
temporally, and there is low correlation between caption scores and attention
consistency scores. These indicate a large gap between humans and machines in
regards to top-down attention, and (5) by integrating the soft attention model
with image saliency, we can significantly improve the model's performance on
Flickr30k and MSCOCO benchmarks. The dataset can be found at:
https://github.com/SenHe/Human-Attention-in-Image-Captioning.Comment: To appear at ICCV 201
Human Attention in Image Captioning: Dataset and Analysis
This is the author accepted manuscript. The final version is available from IEE via the DOI in this record.Data availablility: The dataset
can be found at: https://github.com/SenHe/
Human-Attention-in-Image-CaptioningIn this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks.Engineering and Physical Sciences Research Council (EPSRC)Alan Turing Institut
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
In this paper, we propose an Attentional Generative Adversarial Network
(AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained
text-to-image generation. With a novel attentional generative network, the
AttnGAN can synthesize fine-grained details at different subregions of the
image by paying attentions to the relevant words in the natural language
description. In addition, a deep attentional multimodal similarity model is
proposed to compute a fine-grained image-text matching loss for training the
generator. The proposed AttnGAN significantly outperforms the previous state of
the art, boosting the best reported inception score by 14.14% on the CUB
dataset and 170.25% on the more challenging COCO dataset. A detailed analysis
is also performed by visualizing the attention layers of the AttnGAN. It for
the first time shows that the layered attentional GAN is able to automatically
select the condition at the word level for generating different parts of the
image
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Over the last decade, Convolutional Neural Network (CNN) models have been
highly successful in solving complex vision problems. However, these deep
models are perceived as "black box" methods considering the lack of
understanding of their internal functioning. There has been a significant
recent interest in developing explainable deep learning models, and this paper
is an effort in this direction. Building on a recently proposed method called
Grad-CAM, we propose a generalized method called Grad-CAM++ that can provide
better visual explanations of CNN model predictions, in terms of better object
localization as well as explaining occurrences of multiple object instances in
a single image, when compared to state-of-the-art. We provide a mathematical
derivation for the proposed method, which uses a weighted combination of the
positive partial derivatives of the last convolutional layer feature maps with
respect to a specific class score as weights to generate a visual explanation
for the corresponding class label. Our extensive experiments and evaluations,
both subjective and objective, on standard datasets showed that Grad-CAM++
provides promising human-interpretable visual explanations for a given CNN
architecture across multiple tasks including classification, image caption
generation and 3D action recognition; as well as in new settings such as
knowledge distillation.Comment: 17 Pages, 15 Figures, 11 Tables. Accepted in the proceedings of IEEE
Winter Conf. on Applications of Computer Vision (WACV2018). Extended version
is under review at IEEE Transactions on Pattern Analysis and Machine
Intelligenc
- …