Search CORE

147 research outputs found

Going Deeper into First-Person Activity Recognition

Author: Fan Haoqi
Kitani Kris M.
Ma Minghuang
Publication venue
Publication date: 12/05/2016
Field of study

We bring together ideas from recent work on feature design for egocentric action recognition under one framework by exploring the use of deep convolutional neural networks (CNN). Recent work has shown that features such as hand appearance, object attributes, local hand motion and camera ego-motion are important for characterizing first-person actions. To integrate these ideas under one framework, we propose a twin stream network architecture, where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocentric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations. Our extensive experiments on benchmark egocentric action datasets show that our deep architecture enables recognition rates that significantly outperform state-of-the-art techniques -- an average

6.6\%

increase in accuracy over all datasets. Furthermore, by learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by

30\%

(actions) and

14\%

(objects). We also include the results of extensive ablative analysis to highlight the importance of network design decisions.

arXiv.org e-Print Archive

Towards Semantic Fast-Forward and Stabilized Egocentric Videos

Author: D Potapov
J Kopf
M Gygli
MA Fischler
N Joshi
S Liao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/08/2017
Field of study

The emergence of low-cost personal mobiles devices and wearable cameras and the increasing storage capacity of video-sharing websites have pushed forward a growing interest towards first-person videos. Since most of the recorded videos compose long-running streams with unedited content, they are tedious and unpleasant to watch. The fast-forward state-of-the-art methods are facing challenges of balancing the smoothness of the video and the emphasis in the relevant frames given a speed-up rate. In this work, we present a methodology capable of summarizing and stabilizing egocentric videos by extracting the semantic information from the frames. This paper also describes a dataset collection with several semantically labeled videos and introduces a new smoothness evaluation metric for egocentric videos that is used to test our method.Comment: Accepted for publication and presented in the First International Workshop on Egocentric Perception, Interaction and Computing at European Conference on Computer Vision (EPIC@ECCV) 201

arXiv.org e-Print Archive

Predicting visual context for unsupervised event segmentation in continuous photo-streams

Author: Bolanos Marc
Dang-Nguyen Duc-Tien
del Molino Ana Garcia
del Molino Ana Garcia
Gygli Michael
Lee Yong Jae
Lin Jie
Lin Wei-Hao
Ng Hamg Wei
Srivastava Nitish
Yamamoto Shuhei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/08/2018
Field of study

Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators.Comment: Accepted for publication at the 2018 ACM Multimedia Conference (MM '18

arXiv.org e-Print Archive