528 research outputs found

    Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks

    Full text link
    Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, Long Short-Term Memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action recognition. This network is capable of selectively focusing on the informative joints in each frame of each skeleton sequence by using a global context memory cell. To further improve the attention capability of our network, we also introduce a recurrent attention mechanism, with which the attention performance of the network can be enhanced progressively. Moreover, we propose a stepwise training scheme in order to train our network effectively. Our approach achieves state-of-the-art performance on five challenging benchmark datasets for skeleton based action recognition

    Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

    Full text link
    The discriminative power of modern deep learning models for 3D human action recognition is growing ever so potent. In conjunction with the recent resurgence of 3D human action representation with 3D skeletons, the quality and the pace of recent progress have been significant. However, the inner workings of state-of-the-art learning based methods in 3D human action recognition still remain mostly black-box. In this work, we propose to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition. Compared to popular LSTM-based Recurrent Neural Network models, given interpretable input such as 3D skeletons, TCN provides us a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition. We provide our strategy in re-designing the TCN with interpretability in mind and how such characteristics of the model is leveraged to construct a powerful 3D activity recognition method. Through this work, we wish to take a step towards a spatio-temporal model that is easier to understand, explain and interpret. The resulting model, Res-TCN, achieves state-of-the-art results on the largest 3D human action recognition dataset, NTU-RGBD.Comment: 8 pages, 5 figures, BNMW CVPR 2017 Submissio

    Concurrence-Aware Long Short-Term Sub-Memories for Person-Person Action Recognition

    Full text link
    Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamics for single-person action recognition due to its ability of modeling the temporal information in various ranges of dynamic contexts. However, existing RNN models only focus on capturing the temporal dynamics of the person-person interactions by naively combining the activity dynamics of individuals or modeling them as a whole. This neglects the inter-related dynamics of how person-person interactions change over time. To this end, we propose a novel Concurrence-Aware Long Short-Term Sub-Memories (Co-LSTSM) to model the long-term inter-related dynamics between two interacting people on the bounding boxes covering people. Specifically, for each frame, two sub-memory units store individual motion information, while a concurrent LSTM unit selectively integrates and stores inter-related motion information between interacting people from these two sub-memory units via a new co-memory cell. Experimental results on the BIT and UT datasets show the superiority of Co-LSTSM compared with the state-of-the-art methods

    Investigation of Different Skeleton Features for CNN-based 3D Action Recognition

    Full text link
    Deep learning techniques are being used in skeleton based action recognition tasks and outstanding performance has been reported. Compared with RNN based methods which tend to overemphasize temporal information, CNN-based approaches can jointly capture spatio-temporal information from texture color images encoded from skeleton sequences. There are several skeleton-based features that have proven effective in RNN-based and handcrafted-feature-based methods. However, it remains unknown whether they are suitable for CNN-based approaches. This paper proposes to encode five spatial skeleton features into images with different encoding methods. In addition, the performance implication of different joints used for feature extraction is studied. The proposed method achieved state-of-the-art performance on NTU RGB+D dataset for 3D human action analysis. An accuracy of 75.32\% was achieved in Large Scale 3D Human Activity Analysis Challenge in Depth Videos

    Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks

    Full text link
    Recently, skeleton based action recognition gains more popularity due to cost-effective depth sensors coupled with real-time skeleton estimation algorithms. Traditional approaches based on handcrafted features are limited to represent the complexity of motion patterns. Recent methods that use Recurrent Neural Networks (RNN) to handle raw skeletons only focus on the contextual dependency in the temporal domain and neglect the spatial configurations of articulated skeletons. In this paper, we propose a novel two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton based action recognition. We explore two different structures for the temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed according to human body kinematics. We also propose two effective methods to model the spatial structure by converting the spatial graph into a sequence of joints. To improve generalization of our model, we further exploit 3D transformation based data augmentation techniques including rotation and scaling transformation to transform the 3D coordinates of skeletons during training. Experiments on 3D action recognition benchmark datasets show that our method brings a considerable improvement for a variety of actions, i.e., generic actions, interaction activities and gestures.Comment: Accepted to IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 201
    • …
    corecore