24 research outputs found
Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space
Diverse human motion prediction aims at predicting multiple possible future
pose sequences from a sequence of observed poses. Previous approaches usually
employ deep generative networks to model the conditional distribution of data,
and then randomly sample outcomes from the distribution. While different
results can be obtained, they are usually the most likely ones which are not
diverse enough. Recent work explicitly learns multiple modes of the conditional
distribution via a deterministic network, which however can only cover a fixed
number of modes within a limited range. In this paper, we propose a novel
sampling strategy for sampling very diverse results from an imbalanced
multimodal distribution learned by a deep generative model. Our method works by
generating an auxiliary space and smartly making randomly sampling from the
auxiliary space equivalent to the diverse sampling from the target
distribution. We propose a simple yet effective network architecture that
implements this novel sampling strategy, which incorporates a Gumbel-Softmax
coefficient matrix sampling method and an aggressive diversity promoting hinge
loss function. Extensive experiments demonstrate that our method significantly
improves both the diversity and accuracy of the samplings compared with
previous state-of-the-art sampling approaches. Code and pre-trained models are
available at https://github.com/Droliven/diverse_sampling.Comment: Paper and Supp of our work accepted by ACM MM 202
Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Deriving sophisticated 3D motions from sparse keyframes is a particularly
challenging problem, due to continuity and exceptionally skeletal precision.
The action features are often derivable accurately from the full series of
keyframes, and thus, leveraging the global context with transformers has been a
promising data-driven embedding approach. However, existing methods are often
with inputs of interpolated intermediate frame for continuity using basic
interpolation methods with keyframes, which result in a trivial local minimum
during training. In this paper, we propose a novel framework to formulate
latent motion manifolds with keyframe-based constraints, from which the
continuous nature of intermediate token representations is considered.
Particularly, our proposed framework consists of two stages for identifying a
latent motion subspace, i.e., a keyframe encoding stage and an intermediate
token generation stage, and a subsequent motion synthesis stage to extrapolate
and compose motion data from manifolds. Through our extensive experiments
conducted on both the LaFAN1 and CMU Mocap datasets, our proposed method
demonstrates both superior interpolation accuracy and high visual similarity to
ground truth motions.Comment: Accepted by CVPR 202
Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
Micro-expressions are spontaneous, rapid and subtle facial movements that can
neither be forged nor suppressed. They are very important nonverbal
communication clues, but are transient and of low intensity thus difficult to
recognize. Recently deep learning based methods have been developed for
micro-expression (ME) recognition using feature extraction and fusion
techniques, however, targeted feature learning and efficient feature fusion
still lack further study according to the ME characteristics. To address these
issues, we propose a novel framework Feature Representation Learning with
adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a
convolutional Displacement Generation Module (DGM) with self-supervised
learning is used to extract dynamic features from onset/apex frames targeted to
the subsequent ME recognition task, and a well-designed Transformer Fusion
mechanism composed of three Transformer-based fusion modules (local, global
fusions based on AU regions and full-face fusion) is applied to extract the
multi-level informative features after DGM for the final ME prediction. The
extensive experiments with solid leave-one-subject-out (LOSO) evaluation
results have demonstrated the superiority of our proposed FRL-DGT to
state-of-the-art methods