Egocentric Action Understanding by Learning Embodied Attention

Abstract

Videos captured from wearable cameras, known as egocentric videos, create a continuous record of human daily visual experience, and thereby offer a new perspective for human activity understanding. Importantly, egocentric video aligns gaze, embodied movement, and action in the same “first-person” coordinate system. The rich egocentric cues reflect the attended scene context of an action, and thereby provide novel means for reasoning human daily routines. In my thesis work, I describe my efforts on developing novel computational models that learn the embodied egocentric attention for the automatic analysis of egocentric actions. First, I introduce a probabilistic model for learning gaze and actions in egocentric video and further demonstrate that attention can serve as a robust tool for learning motion-aware video representation. Second, I develop a novel deep model to address the challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. Third, I present a novel deep latent variable model that makes use of human intentional body movement (motor attention) as a key representation for forecasting human-object interaction in egocentric video. Finally, I propose a novel task of future hand segmentation from egocentric videos, and show how explicitly modeling the future head motion can facilitate future hand movement forecasting.Ph.D

    Similar works