33,917 research outputs found
Human Attention in Image Captioning: Dataset and Analysis
In this work, we present a novel dataset consisting of eye movements and
verbal descriptions recorded synchronously over images. Using this data, we
study the differences in human attention during free-viewing and image
captioning tasks. We look into the relationship between human attention and
language constructs during perception and sentence articulation. We also
analyse attention deployment mechanisms in the top-down soft attention approach
that is argued to mimic human attention in captioning tasks, and investigate
whether visual saliency can help image captioning. Our study reveals that (1)
human attention behaviour differs in free-viewing and image description tasks.
Humans tend to fixate on a greater variety of regions under the latter task,
(2) there is a strong relationship between described objects and attended
objects ( of the described objects are being attended), (3) a
convolutional neural network as feature encoder accounts for human-attended
regions during image captioning to a great extent (around ), (4)
soft-attention mechanism differs from human attention, both spatially and
temporally, and there is low correlation between caption scores and attention
consistency scores. These indicate a large gap between humans and machines in
regards to top-down attention, and (5) by integrating the soft attention model
with image saliency, we can significantly improve the model's performance on
Flickr30k and MSCOCO benchmarks. The dataset can be found at:
https://github.com/SenHe/Human-Attention-in-Image-Captioning.Comment: To appear at ICCV 201
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Gaze Embeddings for Zero-Shot Image Classification
Zero-shot image classification using auxiliary information, such as
attributes describing discriminative object properties, requires time-consuming
annotation by domain experts. We instead propose a method that relies on human
gaze as auxiliary information, exploiting that even non-expert users have a
natural ability to judge class membership. We present a data collection
paradigm that involves a discrimination task to increase the information
content obtained from gaze data. Our method extracts discriminative descriptors
from the data and learns a compatibility function between image and gaze using
three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid
(GFG) and Gaze Features with Sequence (GFS). We introduce two new
gaze-annotated datasets for fine-grained image classification and show that
human gaze data is indeed class discriminative, provides a competitive
alternative to expert-annotated attributes, and outperforms other baselines for
zero-shot image classification
- …