68 research outputs found
Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn
This paper presents an image classification based approach for skeleton-based
video action recognition problem. Firstly, A dataset independent
translation-scale invariant image mapping method is proposed, which transformes
the skeleton videos to colour images, named skeleton-images. Secondly, A
multi-scale deep convolutional neural network (CNN) architecture is proposed
which could be built and fine-tuned on the powerful pre-trained CNNs, e.g.,
AlexNet, VGGNet, ResNet etal.. Even though the skeleton-images are very
different from natural images, the fine-tune strategy still works well. At
last, we prove that our method could also work well on 2D skeleton video data.
We achieve the state-of-the-art results on the popular benchmard datasets e.g.
NTU RGB+D, UTD-MHAD, MSRC-12, and G3D. Especially on the largest and challenge
NTU RGB+D, UTD-MHAD, and MSRC-12 dataset, our method outperforms other methods
by a large margion, which proves the efficacy of the proposed method
Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition
Hand action recognition is essential. Communication, human-robot
interactions, and gesture control are dependent on it. Skeleton-based action
recognition traditionally includes hands, which belong to the classes which
remain challenging to correctly recognize to date. We propose a method
specifically designed for hand action recognition which uses relative angular
embeddings and local Spherical Harmonics to create novel hand representations.
The use of Spherical Harmonics creates rotation-invariant representations which
make hand action recognition even more robust against inter-subject differences
and viewpoint changes. We conduct extensive experiments on the hand joints in
the First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose
Annotations, and on the NTU RGB+D 120 dataset, demonstrating the benefit of
using Local Spherical Harmonics Representations. Our code is available at
https://github.com/KathPra/LSHR_LSHT
Transforming spatio-temporal self-attention using action embedding for skeleton-based action recognition
Over the past few years, skeleton-based action recognition has attracted great success because the skeleton data is immune to illumination variation, view-point variation, background clutter, scaling, and camera motion. However, effective modeling of the latent information of skeleton data is still a challenging problem. Therefore, in this paper, we propose a novel idea of action embedding with a self-attention Transformer network for skeleton-based action recognition. Our proposed technology mainly comprises of two modules as, i) action embedding and ii) self-attention Transformer. The action embedding encodes the relationship between corresponding body joints (e.g., joints of both hands move together for performing clapping action) and thus captures the spatial features of joints. Meanwhile, temporal features and dependencies of body joints are modeled using Transformer architecture. Our method works in a single-stream (end-to-end) fashion, where MLP is used for classification. We carry out an ablation study and evaluate the performance of our model on a small-scale SYSU-3D dataset and large-scale NTU-RGB+D and NTU-RGB+D 120 datasets where the results establish that our method performs better than other state-of-the-art architectures.publishedVersio
EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks
Recurrent neural networks (RNNs) are capable of modeling temporal
dependencies of complex sequential data. In general, current available
structures of RNNs tend to concentrate on controlling the contributions of
current and previous information. However, the exploration of different
importance levels of different elements within an input vector is always
ignored. We propose a simple yet effective Element-wise-Attention Gate
(EleAttG), which can be easily added to an RNN block (e.g. all RNN neurons in
an RNN layer), to empower the RNN neurons to have attentiveness capability. For
an RNN block, an EleAttG is used for adaptively modulating the input by
assigning different levels of importance, i.e., attention, to each
element/dimension of the input. We refer to an RNN block equipped with an
EleAttG as an EleAtt-RNN block. Instead of modulating the input as a whole, the
EleAttG modulates the input at fine granularity, i.e., element-wise, and the
modulation is content adaptive. The proposed EleAttG, as an additional
fundamental unit, is general and can be applied to any RNN structures, e.g.,
standard RNN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). We
demonstrate the effectiveness of the proposed EleAtt-RNN by applying it to
different tasks including the action recognition, from both skeleton-based data
and RGB videos, gesture recognition, and sequential MNIST classification.
Experiments show that adding attentiveness through EleAttGs to RNN blocks
significantly improves the power of RNNs.Comment: IEEE Transactions on Image Processing (Accept). arXiv admin note:
substantial text overlap with arXiv:1807.0444
Model-Based High-Dimensional Pose Estimation with Application to Hand Tracking
This thesis presents novel techniques for computer vision based full-DOF human hand motion estimation. Our main contributions are: A robust skin color estimation approach; A novel resolution-independent and memory efficient representation of hand pose silhouettes, which allows us to compute area-based similarity measures in near-constant time; A set of new segmentation-based similarity measures; A new class of similarity measures that work for nearly arbitrary input modalities; A novel edge-based similarity measure that avoids any problematic thresholding or discretizations and can be computed very efficiently in Fourier space; A template hierarchy to minimize the number of similarity computations needed for finding the most likely hand pose observed; And finally, a novel image space search method, which we naturally combine with our hierarchy. Consequently, matching can efficiently be formulated as a simultaneous template tree traversal and function maximization
- …