198,804 research outputs found
Hierarchical transfer learning for online recognition of compound actions
Recognising human actions in real-time can provide users with a natural user interface (NUI) enabling a range of innovative and immersive applications. A NUI application should not restrict users’ movements; it should allow users to transition between actions in quick succession, which we term as compound actions. However, the majority of action recognition researchers have focused on individual actions, so their approaches are limited to recognising single actions or multiple actions that are temporally separated.
This paper proposes a novel online action recognition method for fast detection of compound actions. A key contribution is our hierarchical body model that can be automatically configured to detect actions based on the low level body parts that are the most discriminative for a particular action. Another key contribution is a transfer learning strategy to allow the tasks of action segmentation and whole body modelling to be performed on a related but simpler dataset, combined with automatic hierarchical body model adaption on a more complex target dataset.
Experimental results on a challenging and realistic dataset show an improvement in action recognition performance of 16% due to the introduction of our hierarchical transfer learning. The proposed algorithm is fast with an average latency of just 2 frames (66ms) and outperforms state of the art action recognition algorithms that are capable of fast online action recognition
Hierarchical long short-term memory for action recognition based on 3D skeleton joints from Kinect sensor
Action recognition has been used in a wide range of applications such as human-computer interaction, intelligent video surveillance systems, video summarization, and robotics. Recognizing action is important for intelligent agents to understand, learn and interact with the environment. The recent technology that allows the acquisition of RGB+D and 3D skeleton data and a deep learning model's development significantly increases the action recognition model's performance. In this research, hierarchical Long Sort-Term Memory is proposed to recognize action based on 3D skeleton joints from Kinect sensor. The model uses the 3D axis of skeleton joints and groups each joint in the axis into parts, namely, spine, left and right arm, left and right hand, and left and right leg. To fit the hierarchically structured layers of LSTM, the parts are concatenated into spine, arms, hands, and legs and then concatenated into the body. The model crosses the body in each axis into a single final body and fed to the final layer to classify the action. The performance is measured using cross-view and cross-subject evaluation and achieves accuracy 0.854 and 0.837, respectively, from the 10 action classes of the NTU RGB+D dataset
Multimodal Multipart Learning for Action Recognition in Depth Videos
The articulated and complex nature of human actions makes the task of action
recognition difficult. One approach to handle this complexity is dividing it to
the kinetics of body parts and analyzing the actions based on these partial
descriptors. We propose a joint sparse regression based learning method which
utilizes the structured sparsity to model each action as a combination of
multimodal features from a sparse set of body parts. To represent dynamics and
appearance of parts, we employ a heterogeneous set of depth and skeleton based
features. The proper structure of multimodal multipart features are formulated
into the learning framework via the proposed hierarchical mixed norm, to
regularize the structured features of each part and to apply sparsity between
them, in favor of a group feature selection. Our experimental results expose
the effectiveness of the proposed learning method in which it outperforms other
methods in all three tested datasets while saturating one of them by achieving
perfect accuracy
Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks
Recently, skeleton based action recognition gains more popularity due to
cost-effective depth sensors coupled with real-time skeleton estimation
algorithms. Traditional approaches based on handcrafted features are limited to
represent the complexity of motion patterns. Recent methods that use Recurrent
Neural Networks (RNN) to handle raw skeletons only focus on the contextual
dependency in the temporal domain and neglect the spatial configurations of
articulated skeletons. In this paper, we propose a novel two-stream RNN
architecture to model both temporal dynamics and spatial configurations for
skeleton based action recognition. We explore two different structures for the
temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed
according to human body kinematics. We also propose two effective methods to
model the spatial structure by converting the spatial graph into a sequence of
joints. To improve generalization of our model, we further exploit 3D
transformation based data augmentation techniques including rotation and
scaling transformation to transform the 3D coordinates of skeletons during
training. Experiments on 3D action recognition benchmark datasets show that our
method brings a considerable improvement for a variety of actions, i.e.,
generic actions, interaction activities and gestures.Comment: Accepted to IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR) 201
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
- …