6,195 research outputs found
Learning Action Maps of Large Environments via First-Person Vision
When people observe and interact with physical spaces, they are able to
associate functionality to regions in the environment. Our goal is to automate
dense functional understanding of large spaces by leveraging sparse activity
demonstrations recorded from an ego-centric viewpoint. The method we describe
enables functionality estimation in large scenes where people have behaved, as
well as novel scenes where no behaviors are observed. Our method learns and
predicts "Action Maps", which encode the ability for a user to perform
activities at various locations. With the usage of an egocentric camera to
observe human activities, our method scales with the size of the scene without
the need for mounting multiple static surveillance cameras and is well-suited
to the task of observing activities up-close. We demonstrate that by capturing
appearance-based attributes of the environment and associating these attributes
with activity demonstrations, our proposed mathematical framework allows for
the prediction of Action Maps in new environments. Additionally, we offer a
preliminary glance of the applicability of Action Maps by demonstrating a
proof-of-concept application in which they are used in concert with activity
detections to perform localization.Comment: To appear at CVPR 201
Going Deeper into First-Person Activity Recognition
We bring together ideas from recent work on feature design for egocentric
action recognition under one framework by exploring the use of deep
convolutional neural networks (CNN). Recent work has shown that features such
as hand appearance, object attributes, local hand motion and camera ego-motion
are important for characterizing first-person actions. To integrate these ideas
under one framework, we propose a twin stream network architecture, where one
stream analyzes appearance information and the other stream analyzes motion
information. Our appearance stream encodes prior knowledge of the egocentric
paradigm by explicitly training the network to segment hands and localize
objects. By visualizing certain neuron activation of our network, we show that
our proposed architecture naturally learns features that capture object
attributes and hand-object configurations. Our extensive experiments on
benchmark egocentric action datasets show that our deep architecture enables
recognition rates that significantly outperform state-of-the-art techniques --
an average increase in accuracy over all datasets. Furthermore, by
learning to recognize objects, actions and activities jointly, the performance
of individual recognition tasks also increase by (actions) and
(objects). We also include the results of extensive ablative analysis to
highlight the importance of network design decisions.
- …