959 research outputs found
Multiscale Residual Learning of Graph Convolutional Sequence Chunks for Human Motion Prediction
A new method is proposed for human motion prediction by learning temporal and
spatial dependencies. Recently, multiscale graphs have been developed to model
the human body at higher abstraction levels, resulting in more stable motion
prediction. Current methods however predetermine scale levels and combine
spatially proximal joints to generate coarser scales based on human priors,
even though movement patterns in different motion sequences vary and do not
fully comply with a fixed graph of spatially connected joints. Another problem
with graph convolutional methods is mode collapse, in which predicted poses
converge around a mean pose with no discernible movements, particularly in
long-term predictions. To tackle these issues, we propose ResChunk, an
end-to-end network which explores dynamically correlated body components based
on the pairwise relationships between all joints in individual sequences.
ResChunk is trained to learn the residuals between target sequence chunks in an
autoregressive manner to enforce the temporal connectivities between
consecutive chunks. It is hence a sequence-to-sequence prediction network which
considers dynamic spatio-temporal features of sequences at multiple levels. Our
experiments on two challenging benchmark datasets, CMU Mocap and Human3.6M,
demonstrate that our proposed method is able to effectively model the sequence
information for motion prediction and outperform other techniques to set a new
state-of-the-art. Our code is available at
https://github.com/MohsenZand/ResChunk.Comment: 13 page
Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting
Multi-person pose forecasting remains a challenging problem, especially in
modeling fine-grained human body interaction in complex crowd scenarios.
Existing methods typically represent the whole pose sequence as a temporal
series, yet overlook interactive influences among people based on skeletal body
parts. In this paper, we propose a novel Trajectory-Aware Body Interaction
Transformer (TBIFormer) for multi-person pose forecasting via effectively
modeling body part interactions. Specifically, we construct a Temporal Body
Partition Module that transforms all the pose sequences into a Multi-Person
Body-Part sequence to retain spatial and temporal information based on body
semantics. Then, we devise a Social Body Interaction Self-Attention (SBI-MSA)
module, utilizing the transformed sequence to learn body part dynamics for
inter- and intra-individual interactions. Furthermore, different from prior
Euclidean distance-based spatial encodings, we present a novel and efficient
Trajectory-Aware Relative Position Encoding for SBI-MSA to offer discriminative
spatial information and additional interactive clues. On both short- and
long-term horizons, we empirically evaluate our framework on CMU-Mocap,
MuPoTS-3D as well as synthesized datasets (6 ~ 10 persons), and demonstrate
that our method greatly outperforms the state-of-the-art methods. Code will be
made publicly available upon acceptance.Comment: Accepted by CVPR2023, 8 pages, 6 figures. arXiv admin note: text
overlap with arXiv:2208.0922
EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning
Learning to predict agent motions with relationship reasoning is important
for many applications. In motion prediction tasks, maintaining motion
equivariance under Euclidean geometric transformations and invariance of agent
interaction is a critical and fundamental principle. However, such equivariance
and invariance properties are overlooked by most existing methods. To fill this
gap, we propose EqMotion, an efficient equivariant motion prediction model with
invariant interaction reasoning. To achieve motion equivariance, we propose an
equivariant geometric feature learning module to learn a Euclidean
transformable feature through dedicated designs of equivariant operations. To
reason agent's interactions, we propose an invariant interaction reasoning
module to achieve a more stable interaction modeling. To further promote more
comprehensive motion features, we propose an invariant pattern feature learning
module to learn an invariant pattern feature, which cooperates with the
equivariant geometric feature to enhance network expressiveness. We conduct
experiments for the proposed model on four distinct scenarios: particle
dynamics, molecule dynamics, human skeleton motion prediction and pedestrian
trajectory prediction. Experimental results show that our method is not only
generally applicable, but also achieves state-of-the-art prediction
performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is
available at https://github.com/MediaBrain-SJTU/EqMotion.Comment: Accepted to CVPR 202
Concurrence-Aware Long Short-Term Sub-Memories for Person-Person Action Recognition
Recently, Long Short-Term Memory (LSTM) has become a popular choice to model
individual dynamics for single-person action recognition due to its ability of
modeling the temporal information in various ranges of dynamic contexts.
However, existing RNN models only focus on capturing the temporal dynamics of
the person-person interactions by naively combining the activity dynamics of
individuals or modeling them as a whole. This neglects the inter-related
dynamics of how person-person interactions change over time. To this end, we
propose a novel Concurrence-Aware Long Short-Term Sub-Memories (Co-LSTSM) to
model the long-term inter-related dynamics between two interacting people on
the bounding boxes covering people. Specifically, for each frame, two
sub-memory units store individual motion information, while a concurrent LSTM
unit selectively integrates and stores inter-related motion information between
interacting people from these two sub-memory units via a new co-memory cell.
Experimental results on the BIT and UT datasets show the superiority of
Co-LSTSM compared with the state-of-the-art methods
MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition
Gait emotion recognition plays a crucial role in the intelligent system. Most
of the existing methods recognize emotions by focusing on local actions over
time. However, they ignore that the effective distances of different emotions
in the time domain are different, and the local actions during walking are
quite similar. Thus, emotions should be represented by global states instead of
indirect local actions. To address these issues, a novel Multi Scale Adaptive
Graph Convolution Network (MSA-GCN) is presented in this work through
constructing dynamic temporal receptive fields and designing multiscale
information aggregation to recognize emotions. In our model, a adaptive
selective spatial-temporal graph convolution is designed to select the
convolution kernel dynamically to obtain the soft spatio-temporal features of
different emotions. Moreover, a Cross-Scale mapping Fusion Mechanism (CSFM) is
designed to construct an adaptive adjacency matrix to enhance information
interaction and reduce redundancy. Compared with previous state-of-the-art
methods, the proposed method achieves the best performance on two public
datasets, improving the mAP by 2\%. We also conduct extensive ablations studies
to show the effectiveness of different components in our methods
- …