Search CORE

5,510 research outputs found

Going Deeper into First-Person Activity Recognition

Author: Fan Haoqi
Kitani Kris M.
Ma Minghuang
Publication venue
Publication date: 12/05/2016
Field of study

We bring together ideas from recent work on feature design for egocentric action recognition under one framework by exploring the use of deep convolutional neural networks (CNN). Recent work has shown that features such as hand appearance, object attributes, local hand motion and camera ego-motion are important for characterizing first-person actions. To integrate these ideas under one framework, we propose a twin stream network architecture, where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocentric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations. Our extensive experiments on benchmark egocentric action datasets show that our deep architecture enables recognition rates that significantly outperform state-of-the-art techniques -- an average

6.6\%

increase in accuracy over all datasets. Furthermore, by learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by

30\%

(actions) and

14\%

(objects). We also include the results of extensive ablative analysis to highlight the importance of network design decisions.

arXiv.org e-Print Archive

Crossref

Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation

Author: G Navarro
HS Koppula
L Zappella
Lingling Tao
M Rohrbach
Q Shi
T van Kasteren
Publication venue
Publication date: 30/09/2016
Field of study

Joint segmentation and classification of fine-grained actions is important for applications of human-robot interaction, video surveillance, and human skill evaluation. However, despite substantial recent progress in large-scale action classification, the performance of state-of-the-art fine-grained action recognition approaches remains low. We propose a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier. Our spatiotemporal CNN is comprised of a spatial component that uses convolutional filters to capture information about objects and their relationships, and a temporal component that uses large 1D convolutional filters to capture information about how object relationships change across time. These features are used in tandem with a semi-Markov model that models transitions from one action to another. We introduce an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach. We highlight the effectiveness of our Segmental Spatiotemporal CNN on cooking and surgical action datasets for which we observe substantially improved performance relative to recent baseline methods.Comment: Updated from the ECCV 2016 version. We fixed an important mathematical error and made the section on segmental inference cleare

arXiv.org e-Print Archive

Crossref

Egocentric action recognition from noisy videos

Author: Yang Lijin
Publication venue: 情報理工学系研究科電子情報学専攻
Publication date: 23/03/2020
Field of study

学位の種別: 修士University of Tokyo(東京大学

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Author: Lin Liang
Wang Keze
Wang Meng
Wang Xiaolong
Zuo Wangmeng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2015
Field of study

Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.Comment: This manuscript has 10 pages with 9 figures, and a preliminary version was published in ACM MM'14 conferenc

arXiv.org e-Print Archive

Crossref