3,069 research outputs found
Learning to Predict Gaze in Egocentric Video
We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fix-ations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocen-tric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods. 1
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
Multitask Learning to Improve Egocentric Action Recognition
In this work we employ multitask learning to capitalize on the structure that
exists in related supervised tasks to train complex neural networks. It allows
training a network for multiple objectives in parallel, in order to improve
performance on at least one of them by capitalizing on a shared representation
that is developed to accommodate more information than it otherwise would for a
single task. We employ this idea to tackle action recognition in egocentric
videos by introducing additional supervised tasks. We consider learning the
verbs and nouns from which action labels consist of and predict coordinates
that capture the hand locations and the gaze-based visual saliency for all the
frames of the input video segments. This forces the network to explicitly focus
on cues from secondary tasks that it might otherwise have missed resulting in
improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show
consistent improvements when training with multiple tasks over the single-task
baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in
action recognition by 3.84%. Apart from actions, our method produces accurate
hand and gaze estimations as side tasks, without requiring any additional input
at test time other than the RGB video clips.Comment: 10 pages, 3 figures, accepted at the 5th Egocentric Perception,
Interaction and Computing (EPIC) workshop at ICCV 2019, code repository:
https://github.com/georkap/hand_track_classificatio
Going Deeper into First-Person Activity Recognition
We bring together ideas from recent work on feature design for egocentric
action recognition under one framework by exploring the use of deep
convolutional neural networks (CNN). Recent work has shown that features such
as hand appearance, object attributes, local hand motion and camera ego-motion
are important for characterizing first-person actions. To integrate these ideas
under one framework, we propose a twin stream network architecture, where one
stream analyzes appearance information and the other stream analyzes motion
information. Our appearance stream encodes prior knowledge of the egocentric
paradigm by explicitly training the network to segment hands and localize
objects. By visualizing certain neuron activation of our network, we show that
our proposed architecture naturally learns features that capture object
attributes and hand-object configurations. Our extensive experiments on
benchmark egocentric action datasets show that our deep architecture enables
recognition rates that significantly outperform state-of-the-art techniques --
an average increase in accuracy over all datasets. Furthermore, by
learning to recognize objects, actions and activities jointly, the performance
of individual recognition tasks also increase by (actions) and
(objects). We also include the results of extensive ablative analysis to
highlight the importance of network design decisions.
- …