3,010 research outputs found
LEARNING SALIENCY FOR HUMAN ACTION RECOGNITION
PhDWhen we are looking at a visual stimuli, there are certain areas that stand out
from the neighbouring areas and immediately grab our attention. A map that identi-
es such areas is called a visual saliency map. As humans can easily recognize actions
when watching videos, having their saliency maps available might be bene cial for
a fully automated action recognition system. In this thesis we look into ways of
learning to predict the visual saliency and how to use the learned saliency for action
recognition.
In the rst phase, as opposed to the approaches that use manually designed fea-
tures for saliency prediction, we propose few multilayer architectures for learning
saliency features. First, we learn rst layer features in a two layer architecture using
an unsupervised learning algorithm. Second, we learn second layer features in a two
layer architecture using a supervision from recorded human gaze xations. Third, we
use a deep architecture that learns features at all layers using only supervision from
recorded human gaze xations.
We show that the saliency prediction results we obtain are better than those
obtained by approaches that use manually designed features. We also show that
using a supervision on higher levels yields better saliency prediction results, i.e. the
second approach outperforms the rst, and the third outperforms the second.
In the second phase we focus on how saliency can be used to localize areas that will
be used for action classi cation. In contrast to the manually designed action features,
such as HOG/HOF, we learn the features using a fully supervised deep learning
architecture. We show that our features in combination with the predicted saliency
(from the rst phase) outperform manually designed features. We further develop
an SVM framework that uses the predicted saliency and learned action features to
both localize (in terms of bounding boxes) and classify the actions. We use saliency
prediction as an additional cost in the SVM training and testing procedure when
inferring the bounding box locations. We show that the approach in which saliency
cost is added yields better action recognition results than the approach in which the
cost is not added. The improvement is larger when the cost is added both in training
and testing, rather than just in testing
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
Human Attention in Image Captioning: Dataset and Analysis
In this work, we present a novel dataset consisting of eye movements and
verbal descriptions recorded synchronously over images. Using this data, we
study the differences in human attention during free-viewing and image
captioning tasks. We look into the relationship between human attention and
language constructs during perception and sentence articulation. We also
analyse attention deployment mechanisms in the top-down soft attention approach
that is argued to mimic human attention in captioning tasks, and investigate
whether visual saliency can help image captioning. Our study reveals that (1)
human attention behaviour differs in free-viewing and image description tasks.
Humans tend to fixate on a greater variety of regions under the latter task,
(2) there is a strong relationship between described objects and attended
objects ( of the described objects are being attended), (3) a
convolutional neural network as feature encoder accounts for human-attended
regions during image captioning to a great extent (around ), (4)
soft-attention mechanism differs from human attention, both spatially and
temporally, and there is low correlation between caption scores and attention
consistency scores. These indicate a large gap between humans and machines in
regards to top-down attention, and (5) by integrating the soft attention model
with image saliency, we can significantly improve the model's performance on
Flickr30k and MSCOCO benchmarks. The dataset can be found at:
https://github.com/SenHe/Human-Attention-in-Image-Captioning.Comment: To appear at ICCV 201
- …