9,989 research outputs found
Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation
Recently, mid-level features have shown promising performance in computer
vision. Mid-level features learned by incorporating class-level information are
potentially more discriminative than traditional low-level local features. In
this paper, an effective method is proposed to extract mid-level features from
Kinect skeletons for 3D human action recognition. Firstly, the orientations of
limbs connected by two skeleton joints are computed and each orientation is
encoded into one of the 27 states indicating the spatial relationship of the
joints. Secondly, limbs are combined into parts and the limb's states are
mapped into part states. Finally, frequent pattern mining is employed to mine
the most frequent and relevant (discriminative, representative and
non-redundant) states of parts in continuous several frames. These parts are
referred to as Frequent Local Parts or FLPs. The FLPs allow us to build
powerful bag-of-FLP-based action representation. This new representation yields
state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D
Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks
Human action recognition in 3D skeleton sequences has attracted a lot of
research attention. Recently, Long Short-Term Memory (LSTM) networks have shown
promising performance in this task due to their strengths in modeling the
dependencies and dynamics in sequential data. As not all skeletal joints are
informative for action recognition, and the irrelevant joints often bring noise
which can degrade the performance, we need to pay more attention to the
informative ones. However, the original LSTM network does not have explicit
attention ability. In this paper, we propose a new class of LSTM network,
Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action
recognition. This network is capable of selectively focusing on the informative
joints in each frame of each skeleton sequence by using a global context memory
cell. To further improve the attention capability of our network, we also
introduce a recurrent attention mechanism, with which the attention performance
of the network can be enhanced progressively. Moreover, we propose a stepwise
training scheme in order to train our network effectively. Our approach
achieves state-of-the-art performance on five challenging benchmark datasets
for skeleton based action recognition
Simultaneous Feature and Body-Part Learning for Real-Time Robot Awareness of Human Behaviors
Robot awareness of human actions is an essential research problem in robotics
with many important real-world applications, including human-robot
collaboration and teaming. Over the past few years, depth sensors have become a
standard device widely used by intelligent robots for 3D perception, which can
also offer human skeletal data in 3D space. Several methods based on skeletal
data were designed to enable robot awareness of human actions with satisfactory
accuracy. However, previous methods treated all body parts and features equally
important, without the capability to identify discriminative body parts and
features. In this paper, we propose a novel simultaneous Feature And Body-part
Learning (FABL) approach that simultaneously identifies discriminative body
parts and features, and efficiently integrates all available information
together to enable real-time robot awareness of human behaviors. We formulate
FABL as a regression-like optimization problem with structured
sparsity-inducing norms to model interrelationships of body parts and features.
We also develop an optimization algorithm to solve the formulated problem,
which possesses a theoretical guarantee to find the optimal solution. To
evaluate FABL, three experiments were performed using public benchmark
datasets, including the MSR Action3D and CAD-60 datasets, as well as a Baxter
robot in practical assistive living applications. Experimental results show
that our FABL approach obtains a high recognition accuracy with a processing
speed of the order-of-magnitude of 10e4 Hz, which makes FABL a promising method
to enable real-time robot awareness of human behaviors in practical robotics
applications.Comment: 8 pages, 6 figures, accepted by ICRA'1
Investigation of Different Skeleton Features for CNN-based 3D Action Recognition
Deep learning techniques are being used in skeleton based action recognition
tasks and outstanding performance has been reported. Compared with RNN based
methods which tend to overemphasize temporal information, CNN-based approaches
can jointly capture spatio-temporal information from texture color images
encoded from skeleton sequences. There are several skeleton-based features that
have proven effective in RNN-based and handcrafted-feature-based methods.
However, it remains unknown whether they are suitable for CNN-based approaches.
This paper proposes to encode five spatial skeleton features into images with
different encoding methods. In addition, the performance implication of
different joints used for feature extraction is studied. The proposed method
achieved state-of-the-art performance on NTU RGB+D dataset for 3D human action
analysis. An accuracy of 75.32\% was achieved in Large Scale 3D Human Activity
Analysis Challenge in Depth Videos
Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks
This paper proposes three simple, compact yet effective representations of
depth sequences, referred to respectively as Dynamic Depth Images (DDI),
Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images
(DDMNI). These dynamic images are constructed from a sequence of depth maps
using bidirectional rank pooling to effectively capture the spatial-temporal
information. Such image-based representations enable us to fine-tune the
existing ConvNets models trained on image data for classification of depth
sequences, without introducing large parameters to learn. Upon the proposed
representations, a convolutional Neural networks (ConvNets) based method is
developed for gesture recognition and evaluated on the Large-scale Isolated
Gesture Recognition at the ChaLearn Looking at People (LAP) challenge 2016. The
method achieved 55.57\% classification accuracy and ranked place in
this challenge but was very close to the best performance even though we only
used depth data.Comment: arXiv admin note: text overlap with arXiv:1608.0633
Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks
Recently, Convolutional Neural Networks (ConvNets) have shown promising
performances in many computer vision tasks, especially image-based recognition.
How to effectively use ConvNets for video-based recognition is still an open
problem. In this paper, we propose a compact, effective yet simple method to
encode spatio-temporal information carried in skeleton sequences into
multiple images, referred to as Joint Trajectory Maps (JTM), and ConvNets
are adopted to exploit the discriminative features for real-time human action
recognition. The proposed method has been evaluated on three public benchmarks,
i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal
human action dataset (UTD-MHAD) and achieved the state-of-the-art results
Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions
3D action recognition has broad applications in human-computer interaction
and intelligent surveillance. However, recognizing similar actions remains
challenging since previous literature fails to capture motion and shape cues
effectively from noisy depth data. In this paper, we propose a novel two-layer
Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and
jointly encodes both motion and shape cues. First, background clutter is
removed by a background modeling method that is designed for depth data. Then,
motion and shape cues are jointly used to generate robust and distinctive
spatial-temporal interest points (STIPs): motion-based STIPs and shape-based
STIPs. In the first layer of our model, a multi-scale 3D local steering kernel
(M3DLSK) descriptor is proposed to describe local appearances of cuboids around
motion-based STIPs. In the second layer, a spatial-temporal vector (STV)
descriptor is proposed to describe the spatial-temporal distributions of
shape-based STIPs. Using the Bag-of-Visual-Words (BoVW) model, motion and shape
cues are combined to form a fused action representation. Our model performs
favorably compared with common STIP detection and description methods. Thorough
experiments verify that our model is effective in distinguishing similar
actions and robust to background clutter, partial occlusions and pepper noise
- …