4 research outputs found
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions
In this work, we address two coupled tasks of gaze prediction and action
recognition in egocentric videos by exploring their mutual context. Our
assumption is that in the procedure of performing a manipulation task, what a
person is doing determines where the person is looking at, and the gaze point
reveals gaze and non-gaze regions which contain important and complementary
information about the undergoing action. We propose a novel mutual context
network (MCN) that jointly learns action-dependent gaze prediction and
gaze-guided action recognition in an end-to-end manner. Experiments on public
egocentric video datasets demonstrate that our MCN achieves state-of-the-art
performance of both gaze prediction and action recognition