431 research outputs found
Differential Recurrent Neural Networks for Human Activity Recognition
Human activity recognition has been an active research area in recent years. The difficulty of this problem lies in the complex dynamical motion patterns embedded through the sequential frames. The Long Short-Term Memory (LSTM) recurrent neural network is capable of processing complex sequential information since it utilizes special gating schemes for learning representations from long input sequences. It has the potential to model various time-series data, where the current hidden state has to be considered in the context of the past hidden states. Unfortunately, the conventional LSTMs do not consider the impact of spatio-temporal dynamics corresponding to the given salient motion patterns, when they gate the information that ought to be memorized through time. To address this problem, we propose a differential gating scheme for the LSTM neural network, which emphasizes the change in information gain caused by the salient motions between the successive video frames. This change in information gain is quantified by Derivative of States (DoS), and thus the proposed LSTM model is termed differential Recurrent Neural Network (dRNN). Based on the energy profiling of DoS, we further propose to employ the State Energy Profile (SEP) to search for salient dRNN states and construct more informative representations. To better understand the scene and human appearance information, the dRNN model is extended by connecting Convolutional Neural Networks (CNN) and stacked dRNNs into an end-to-end model. Lastly, the dissertation continues to discuss and compare the combined and the individual orders of DoS used within the dRNN. We propose to control the LSTM gates via individual order of DoS and stack multiple levels of LSTM cells in increasing orders of state derivatives. To this end, we have introduced a new family of LSTMs, expanding the applications of LSTMs and advancing the performances of the state-of-the-art methods
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community
In recent years, deep learning (DL), a re-branding of neural networks (NNs),
has risen to the top in numerous areas, namely computer vision (CV), speech
recognition, natural language processing, etc. Whereas remote sensing (RS)
possesses a number of unique challenges, primarily related to sensors and
applications, inevitably RS draws from many of the same theories as CV; e.g.,
statistics, fusion, and machine learning, to name a few. This means that the RS
community should be aware of, if not at the leading edge of, of advancements
like DL. Herein, we provide the most comprehensive survey of state-of-the-art
RS DL research. We also review recent new developments in the DL field that can
be used in DL for RS. Namely, we focus on theories, tools and challenges for
the RS community. Specifically, we focus on unsolved challenges and
opportunities as it relates to (i) inadequate data sets, (ii)
human-understandable solutions for modelling physical phenomena, (iii) Big
Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and
learning algorithms for spectral, spatial and temporal data, (vi) transfer
learning, (vii) an improved theoretical understanding of DL systems, (viii)
high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote
Sensin
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
Triplet Attention Transformer for Spatiotemporal Predictive Learning
Spatiotemporal predictive learning offers a self-supervised learning paradigm
that enables models to learn both spatial and temporal patterns by predicting
future sequences based on historical sequences. Mainstream methods are
dominated by recurrent units, yet they are limited by their lack of
parallelization and often underperform in real-world scenarios. To improve
prediction quality while maintaining computational efficiency, we propose an
innovative triplet attention transformer designed to capture both inter-frame
dynamics and intra-frame static features. Specifically, the model incorporates
the Triplet Attention Module (TAM), which replaces traditional recurrent units
by exploring self-attention mechanisms in temporal, spatial, and channel
dimensions. In this configuration: (i) temporal tokens contain abstract
representations of inter-frame, facilitating the capture of inherent temporal
dependencies; (ii) spatial and channel attention combine to refine the
intra-frame representation by performing fine-grained interactions across
spatial and channel dimensions. Alternating temporal, spatial, and
channel-level attention allows our approach to learn more complex short- and
long-range spatiotemporal dependencies. Extensive experiments demonstrate
performance surpassing existing recurrent-based and recurrent-free methods,
achieving state-of-the-art under multi-scenario examination including moving
object trajectory prediction, traffic flow prediction, driving scene
prediction, and human motion capture.Comment: Accepted to WACV 202
Human Trajectory Prediction via Neural Social Physics
Trajectory prediction has been widely pursued in many fields, and many
model-based and model-free methods have been explored. The former include
rule-based, geometric or optimization-based models, and the latter are mainly
comprised of deep learning approaches. In this paper, we propose a new method
combining both methodologies based on a new Neural Differential Equation model.
Our new model (Neural Social Physics or NSP) is a deep neural network within
which we use an explicit physics model with learnable parameters. The explicit
physics model serves as a strong inductive bias in modeling pedestrian
behaviors, while the rest of the network provides a strong data-fitting
capability in terms of system parameter estimation and dynamics stochasticity
modeling. We compare NSP with 15 recent deep learning methods on 6 datasets and
improve the state-of-the-art performance by 5.56%-70%. Besides, we show that
NSP has better generalizability in predicting plausible trajectories in
drastically different scenarios where the density is 2-5 times as high as the
testing data. Finally, we show that the physics model in NSP can provide
plausible explanations for pedestrian behaviors, as opposed to black-box deep
learning. Code is available:
https://github.com/realcrane/Human-Trajectory-Prediction-via-Neural-Social-Physics.Comment: ECCV 202
Recommended from our members
Human Motion Anticipation and Recognition from RGB-D
Predicting and understanding the dynamic of human motion has many applications such as motion synthesis, augmented reality, security, education, reinforcement learning, autonomous vehicles, and many others. In this thesis, we create a novel end-to-end pipeline that can predict multiple future poses from the same input, and, in addition, can classify the entire sequence. Our focus is on the following two aspects of human motion understanding:
Probabilistic human action prediction: Given a sequence of human poses as input, we sample multiple possible future poses from the same input sequence using a new GAN-based network.
Human motion understanding: Given a sequence of human poses as input, we classify the actual action performed in the sequence and improve the classification performance using the presentation learned from the prediction network.
We also demonstrate how to improve model training from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. We shared the enhanced FER+ data set with multiple labels for each face image with the research community (https://github.com/Microsoft/FERPlus).
For predicting and understanding of human motion, we propose a novel sequence-to-sequence model trained with an improved version of generative adversarial networks (GAN). Our model, which we call HP-GAN2, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but seeded with a different vector z drawn from a random distribution. Moreover, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton pose sequence is a real or fake human motion.
In order to classify the action performed in a video clip, we took two approaches. In the first approach, we train on a sequence of skeleton poses from scratch using random parameters initialization with the same network architecture used in the discriminator of the HP-GAN2 model. For the second approach, we use the discriminator of the HP-GAN2 network, extend it with an action classification branch, and fine tune the end-to-end model on the classification tasks, since the discriminator in HP-GAN2 learned to differentiate between fake and real human motion. So, our hypothesis is that if the discriminator network can differentiate between synthetic and real skeleton poses, then it also has learned some of the dynamics of a real human motion, and that those dynamics are useful in classification as well. We will show through multiple experiments that that is indeed the case.
Therefore, our model learns to predict multiple future sequences of human poses from the same input sequence. We also show that the discriminator learns a general representation of human motion by using the learned features in an action recognition task. And we train a motion-quality-assessment network that measure the probability of a given sequence of poses are valid human poses or not.
We test our model on two of the largest human pose datasets: NTURGB-D, and Human3.6M. We train on both single and multiple action types. The predictive power of our model for motion estimation is demonstrated by generating multiple plausible futures from the same input and showing the effect of each of the several loss functions in the ablation study. We also show the advantage of switching to GAN from WGAN-GP, which we used in our previous work. Furthermore, we show that it takes less than half the number of epochs to train an activity recognition network by using the features learned from the discriminator
- …