413 research outputs found
Deep representation learning for human motion prediction and classification
Generative models of 3D human motion are often restricted to a small number
of activities and can therefore not generalize well to novel movements or
applications. In this work we propose a deep learning framework for human
motion capture data that learns a generic representation from a large corpus of
motion capture data and generalizes well to new, unseen, motions. Using an
encoding-decoding network that learns to predict future 3D poses from the most
recent past, we extract a feature representation of human motion. Most work on
deep learning for sequence prediction focuses on video and speech. Since
skeletal data has a different structure, we present and evaluate different
network architectures that make different assumptions about time dependencies
and limb correlations. To quantify the learned features, we use the output of
different layers for action classification and visualize the receptive fields
of the network units. Our method outperforms the recent state of the art in
skeletal motion prediction even though these use action specific training data.
Our results show that deep feedforward networks, trained from a generic mocap
database, can successfully be used for feature extraction from human motion
data and that this representation can be used as a foundation for
classification and prediction.Comment: This paper is published at the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 201
Learning Bodily and Temporal Attention in Protective Movement Behavior Detection
For people with chronic pain, the assessment of protective behavior during
physical functioning is essential to understand their subjective pain-related
experiences (e.g., fear and anxiety toward pain and injury) and how they deal
with such experiences (avoidance or reliance on specific body joints), with the
ultimate goal of guiding intervention. Advances in deep learning (DL) can
enable the development of such intervention. Using the EmoPain MoCap dataset,
we investigate how attention-based DL architectures can be used to improve the
detection of protective behavior by capturing the most informative temporal and
body configurational cues characterizing specific movements and the strategies
used to perform them. We propose an end-to-end deep learning architecture named
BodyAttentionNet (BANet). BANet is designed to learn temporal and bodily parts
that are more informative to the detection of protective behavior. The approach
addresses the variety of ways people execute a movement (including healthy
people) independently of the type of movement analyzed. Through extensive
comparison experiments with other state-of-the-art machine learning techniques
used with motion capture data, we show statistically significant improvements
achieved by using these attention mechanisms. In addition, the BANet
architecture requires a much lower number of parameters than the state of the
art for comparable if not higher performances.Comment: 7 pages, 3 figures, 2 tables, code available, accepted in ACII 201
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
We study the problem of human action recognition using motion capture (MoCap)
sequences. Unlike existing techniques that take multiple manual steps to derive
standardized skeleton representations as model input, we propose a novel
Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The model uses a hierarchical transformer with intra-frame off-set attention
and inter-frame self-attention. The attention mechanism allows the model to
freely attend between any two vertex patches to learn non-local relationships
in the spatial-temporal domain. Masked vertex modeling and future frame
prediction are used as two self-supervised tasks to fully activate the
bi-directional and auto-regressive attention in our hierarchical transformer.
The proposed method achieves state-of-the-art performance compared to
skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is
available at https://github.com/zgzxy001/STMT.Comment: CVPR 202
HP-GAN: Probabilistic 3D human motion prediction via GAN
Predicting and understanding human motion dynamics has many applications,
such as motion synthesis, augmented reality, security, and autonomous vehicles.
Due to the recent success of generative adversarial networks (GAN), there has
been much interest in probabilistic estimation and synthetic data generation
using deep neural network architectures and learning algorithms.
We propose a novel sequence-to-sequence model for probabilistic human motion
prediction, trained with a modified version of improved Wasserstein generative
adversarial networks (WGAN-GP), in which we use a custom loss function designed
for human motion prediction. Our model, which we call HP-GAN, learns a
probability density function of future human poses conditioned on previous
poses. It predicts multiple sequences of possible future human poses, each from
the same input sequence but a different vector z drawn from a random
distribution. Furthermore, to quantify the quality of the non-deterministic
predictions, we simultaneously train a motion-quality-assessment model that
learns the probability that a given skeleton sequence is a real human motion.
We test our algorithm on two of the largest skeleton datasets: NTURGB-D and
Human3.6M. We train our model on both single and multiple action types. Its
predictive power for long-term motion estimation is demonstrated by generating
multiple plausible futures of more than 30 frames from just 10 frames of input.
We show that most sequences generated from the same input have more than 50\%
probabilities of being judged as a real human sequence. We will release all the
code used in this paper to Github
A Locality-based Neural Solver for Optical Motion Capture
We present a novel locality-based learning method for cleaning and solving
optical motion capture data. Given noisy marker data, we propose a new
heterogeneous graph neural network which treats markers and joints as different
types of nodes, and uses graph convolution operations to extract the local
features of markers and joints and transform them to clean motions. To deal
with anomaly markers (e.g. occluded or with big tracking errors), the key
insight is that a marker's motion shows strong correlations with the motions of
its immediate neighboring markers but less so with other markers, a.k.a.
locality, which enables us to efficiently fill missing markers (e.g. due to
occlusion). Additionally, we also identify marker outliers due to tracking
errors by investigating their acceleration profiles. Finally, we propose a
training regime based on representation learning and data augmentation, by
training the model on data with masking. The masking schemes aim to mimic the
occluded and noisy markers often observed in the real data. Finally, we show
that our method achieves high accuracy on multiple metrics across various
datasets. Extensive comparison shows our method outperforms state-of-the-art
methods in terms of prediction accuracy of occluded marker position error by
approximately 20%, which leads to a further error reduction on the
reconstructed joint rotations and positions by 30%. The code and data for this
paper are available at https://github.com/non-void/LocalMoCap.Comment: Siggraph Asia 2023 Conference Pape
Forecasting Human Dynamics from Static Images
This paper presents the first study on forecasting human dynamics from static
images. The problem is to input a single RGB image and generate a sequence of
upcoming human body poses in 3D. To address the problem, we propose the 3D Pose
Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on
single-image human pose estimation and sequence prediction, and converts the 2D
predictions into 3D space. We train our 3D-PFNet using a three-step training
strategy to leverage a diverse source of training data, including image and
video based human pose datasets and 3D motion capture (MoCap) data. We
demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and
3D pose recovery through quantitative and qualitative results.Comment: Accepted in CVPR 201
Recent advances in video-based human action recognition using deep learning: A review
© 2017 IEEE. Video-based human action recognition has become one of the most popular research areas in the field of computer vision and pattern recognition in recent years. It has a wide variety of applications such as surveillance, robotics, health care, video searching and human-computer interaction. There are many challenges involved in human action recognition in videos, such as cluttered backgrounds, occlusions, viewpoint variation, execution rate, and camera motion. A large number of techniques have been proposed to address the challenges over the decades. Three different types of datasets namely, single viewpoint, multiple viewpoint and RGB-depth videos, are used for research. This paper presents a review of various state-of-the-art deep learning-based techniques proposed for human action recognition on the three types of datasets. In light of the growing popularity and the recent developments in video-based human action recognition, this review imparts details of current trends and potential directions for future work to assist researchers
Automatische Schätzung der Körperpose mit CNNs und LSTMs
In this thesis, we present an end-to-end approach to human pose estimation task that based on a deep hybrid architecture that combines convolutional neural network (CNNs) and recurrent neural networks (RNNs)
- …