27,191 research outputs found
Interpretable 3D Human Action Analysis with Temporal Convolutional Networks
The discriminative power of modern deep learning models for 3D human action
recognition is growing ever so potent. In conjunction with the recent
resurgence of 3D human action representation with 3D skeletons, the quality and
the pace of recent progress have been significant. However, the inner workings
of state-of-the-art learning based methods in 3D human action recognition still
remain mostly black-box. In this work, we propose to use a new class of models
known as Temporal Convolutional Neural Networks (TCN) for 3D human action
recognition. Compared to popular LSTM-based Recurrent Neural Network models,
given interpretable input such as 3D skeletons, TCN provides us a way to
explicitly learn readily interpretable spatio-temporal representations for 3D
human action recognition. We provide our strategy in re-designing the TCN with
interpretability in mind and how such characteristics of the model is leveraged
to construct a powerful 3D activity recognition method. Through this work, we
wish to take a step towards a spatio-temporal model that is easier to
understand, explain and interpret. The resulting model, Res-TCN, achieves
state-of-the-art results on the largest 3D human action recognition dataset,
NTU-RGBD.Comment: 8 pages, 5 figures, BNMW CVPR 2017 Submissio
A Grid-based Representation for Human Action Recognition
Human action recognition (HAR) in videos is a fundamental research topic in
computer vision. It consists mainly in understanding actions performed by
humans based on a sequence of visual observations. In recent years, HAR have
witnessed significant progress, especially with the emergence of deep learning
models. However, most of existing approaches for action recognition rely on
information that is not always relevant for this task, and are limited in the
way they fuse the temporal information. In this paper, we propose a novel
method for human action recognition that encodes efficiently the most
discriminative appearance information of an action with explicit attention on
representative pose features, into a new compact grid representation. Our GRAR
(Grid-based Representation for Action Recognition) method is tested on several
benchmark datasets demonstrating that our model can accurately recognize human
actions, despite intra-class appearance variations and occlusion challenges.Comment: Accepted on 25th International Conference on Pattern Recognition
(ICPR 2020
Deep temporal motion descriptor (DTMD) for human action recognition
Spatiotemporal features have significant importance in human action recognition, as they provide the actor's shape and motion characteristics specific to each action class. This paper presents a new deep spatiotemporal human action representation, \Deep Temporal Motion Descriptor (DTMD)", which shares the attributes of holistic and deep learned features. To generate the DTMD descriptor, the actor's silhouettes are gathered into single motion templates through applying motion history images. These motion templates capture the spatiotemporal movements of the actor and compactly represents the human actions using a single 2D template. Then, deep convolutional neural networks are used to compute discriminative deep features from motion history templates to produce DTMD. Later, DTMD is used for learn a model to recognise human actions using a softmax classifier. The advantage of DTMD comes from (i) DTMD is automatically learned from videos and contains higher dimensional discriminative spatiotemporal representation as compared to handcrafted features; (ii) DTMD reduces the computational complexity of human activity recognition as all the video frames are compactly represented as a single motion template; (iii) DTMD works e ectively for single and multiview action recognition. We conducted experiments on three challenging datasets: MuHAVI-Uncut, iXMAS, and IAVID-1. The experimental findings reveal that DTMD outperforms previous methods and achieves the highest action prediction rate on the MuHAVI-Uncut datase
Contextual Statistics of Space-Time Ordered Features for Human Action Recognition
International audienceThe bag-of-words approach with local spatio-temporal features have become a popular video representation for action recognition. Recent methods have typically focused on capturing global and local statistics of features. However, existing approaches ignore relations between the features, particularly space-time arrangement of features, and thus may not be discriminative enough. Therefore, we propose a novel figure-centric representation which captures both local density of features and statistics of space-time ordered features. Using two benchmark datasets for human action recognition, we demonstrate that our representation enhances the discriminative power of features and improves action recognition performance, achieving 96.16% recognition rate on popular KTH action dataset and 93.33% on challenging ADL dataset
Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition
International audienceThe bag-of-words approach with local spatio-temporal features have become a popular video representation for action recognition in videos. Together these techniques have demonstrated high recognition results for a number of action classes. Recent approaches have typically focused on capturing global statistics of features. However, existing methods ignore relations between features and thus may not be discriminative enough. Therefore, we propose a novel feature representation which captures statistics of pairwise co-occurring local spatio-temporal features. Our representation captures not only global distribution of features but also focuses on geometric and appearance (both visual and motion) relations among the features. Calculating a set of bag-of-words representations with different geometrical arrangement among the features, we keep an important association between appearance and geometric information. Using two benchmark datasets for human action recognition, we demonstrate that our representation enhances the discriminative power of features and improves action recognition performance
Latent Semantic Learning with Structured Sparse Representation for Human Action Recognition
This paper proposes a novel latent semantic learning method for extracting
high-level features (i.e. latent semantics) from a large vocabulary of abundant
mid-level features (i.e. visual keywords) with structured sparse
representation, which can help to bridge the semantic gap in the challenging
task of human action recognition. To discover the manifold structure of
midlevel features, we develop a spectral embedding approach to latent semantic
learning based on L1-graph, without the need to tune any parameter for graph
construction as a key step of manifold learning. More importantly, we construct
the L1-graph with structured sparse representation, which can be obtained by
structured sparse coding with its structured sparsity ensured by novel L1-norm
hypergraph regularization over mid-level features. In the new embedding space,
we learn latent semantics automatically from abundant mid-level features
through spectral clustering. The learnt latent semantics can be readily used
for human action recognition with SVM by defining a histogram intersection
kernel. Different from the traditional latent semantic analysis based on topic
models, our latent semantic learning method can explore the manifold structure
of mid-level features in both L1-graph construction and spectral embedding,
which results in compact but discriminative high-level features. The
experimental results on the commonly used KTH action dataset and unconstrained
YouTube action dataset show the superior performance of our method.Comment: The short version of this paper appears in ICCV 201
- …