213 research outputs found
UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons
The automatic co-speech gesture generation draws much attention in computer
animation. Previous works designed network structures on individual datasets,
which resulted in a lack of data volume and generalizability across different
motion capture standards. In addition, it is a challenging task due to the weak
correlation between speech and gestures. To address these problems, we present
UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis
approach, trained on multiple gesture datasets with different skeletons.
Specifically, we first present a retargeting network to learn latent
homeomorphic graphs for different motion capture standards, unifying the
representations of various gestures while extending the dataset. We then
capture the correlation between speech and gestures based on a diffusion model
architecture using cross-local attention and self-attention to generate better
speech-matched and realistic gestures. To further align speech and gesture and
increase diversity, we incorporate reinforcement learning on the discrete
gesture units with a learned reward function. Extensive experiments show that
UnifiedGesture outperforms recent approaches on speech-driven gesture
generation in terms of CCA, FGD, and human-likeness. All code, pre-trained
models, databases, and demos are available to the public at
https://github.com/YoungSeng/UnifiedGesture.Comment: 16 pages, 11 figures, ACM MM 202
Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior
Creating believable motions for various characters has long been a goal in
computer graphics. Current learning-based motion synthesis methods depend on
extensive motion datasets, which are often challenging, if not impossible, to
obtain. On the other hand, pose data is more accessible, since static posed
characters are easier to create and can even be extracted from images using
recent advancements in computer vision. In this paper, we utilize this
alternative data source and introduce a neural motion synthesis approach
through retargeting. Our method generates plausible motions for characters that
have only pose data by transferring motion from an existing motion capture
dataset of another character, which can have drastically different skeletons.
Our experiments show that our method effectively combines the motion features
of the source character with the pose features of the target character, and
performs robustly with small or noisy pose data sets, ranging from a few
artist-created poses to noisy poses estimated directly from images.
Additionally, a conducted user study indicated that a majority of participants
found our retargeted motion to be more enjoyable to watch, more lifelike in
appearance, and exhibiting fewer artifacts. Project page:
https://cyanzhao42.github.io/pose2motionComment: Project page: https://cyanzhao42.github.io/pose2motio
LAC: Latent Action Composition for Skeleton-based Action Segmentation
Skeleton-based action segmentation requires recognizing composable actions in
untrimmed videos. Current approaches decouple this problem by first extracting
local visual features from skeleton sequences and then processing them by a
temporal model to classify frame-wise actions. However, their performances
remain limited as the visual features cannot sufficiently express composable
actions. In this context, we propose Latent Action Composition (LAC), a novel
self-supervised framework aiming at learning from synthesized composable
motions for skeleton-based action segmentation. LAC is composed of a novel
generation module towards synthesizing new sequences. Specifically, we design a
linear latent space in the generator to represent primitive motion. New
composed motions can be synthesized by simply performing arithmetic operations
on latent representations of multiple input skeleton sequences. LAC leverages
such synthesized sequences, which have large diversity and complexity, for
learning visual representations of skeletons in both sequence and frame spaces
via contrastive learning. The resulting visual encoder has a high expressive
power and can be effectively transferred onto action segmentation tasks by
end-to-end fine-tuning without the need for additional temporal models. We
conduct a study focusing on transfer-learning and we show that representations
learned from pre-trained LAC outperform the state-of-the-art by a large margin
on TSU, Charades, PKU-MMD datasets.Comment: ICCV 202
Unsupervised human-to-robot motion retargeting via expressive latent space
This paper introduces a novel approach for human-to-robot motion retargeting,
enabling robots to mimic human motion with precision while preserving the
semantics of the motion. For that, we propose a deep learning method for direct
translation from human to robot motion. Our method does not require annotated
paired human-to-robot motion data, which reduces the effort when adopting new
robots. To this end, we first propose a cross-domain similarity metric to
compare the poses from different domains (i.e., human and robot). Then, our
method achieves the construction of a shared latent space via contrastive
learning and decodes latent representations to robot motion control commands.
The learned latent space exhibits expressiveness as it captures the motions
precisely and allows direct motion control in the latent space. We showcase how
to generate in-between motion through simple linear interpolation in the latent
space between two projected human poses. Additionally, we conducted a
comprehensive evaluation of robot control using diverse modality inputs, such
as texts, RGB videos, and key-poses, which enhances the ease of robot control
to users of all backgrounds. Finally, we compare our model with existing works
and quantitatively and qualitatively demonstrate the effectiveness of our
approach, enhancing natural human-robot communication and fostering trust in
integrating robots into daily life
Zero-shot Pose Transfer for Unrigged Stylized 3D Characters
Transferring the pose of a reference avatar to stylized 3D characters of
various shapes is a fundamental task in computer graphics. Existing methods
either require the stylized characters to be rigged, or they use the stylized
character in the desired pose as ground truth at training. We present a
zero-shot approach that requires only the widely available deformed
non-stylized avatars in training, and deforms stylized characters of
significantly different shapes at inference. Classical methods achieve strong
generalization by deforming the mesh at the triangle level, but this requires
labelled correspondences. We leverage the power of local deformation, but
without requiring explicit correspondence labels. We introduce a
semi-supervised shape-understanding module to bypass the need for explicit
correspondences at test time, and an implicit pose deformation module that
deforms individual surface points to match the target pose. Furthermore, to
encourage realistic and accurate deformation of stylized characters, we
introduce an efficient volume-based test-time training procedure. Because it
does not need rigging, nor the deformed stylized character at training time,
our model generalizes to categories with scarce annotation, such as stylized
quadrupeds. Extensive experiments demonstrate the effectiveness of the proposed
method compared to the state-of-the-art approaches trained with comparable or
more supervision. Our project page is available at
https://jiashunwang.github.io/ZPTComment: CVPR 202
Self-Supervised Motion Retargeting with Safety Guarantee
In this paper, we present self-supervised shared latent embedding (S3LE), a
data-driven motion retargeting method that enables the generation of natural
motions in humanoid robots from motion capture data or RGB videos. While it
requires paired data consisting of human poses and their corresponding robot
configurations, it significantly alleviates the necessity of time-consuming
data-collection via novel paired data generating processes. Our self-supervised
learning procedure consists of two steps: automatically generating paired data
to bootstrap the motion retargeting, and learning a projection-invariant
mapping to handle the different expressivity of humans and humanoid robots.
Furthermore, our method guarantees that the generated robot pose is
collision-free and satisfies position limits by utilizing nonparametric
regression in the shared latent space. We demonstrate that our method can
generate expressive robotic motions from both the CMU motion capture database
and YouTube videos
HumanMimic: Learning Natural Locomotion and Transitions for Humanoid Robot via Wasserstein Adversarial Imitation
Transferring human motion skills to humanoid robots remains a significant
challenge. In this study, we introduce a Wasserstein adversarial imitation
learning system, allowing humanoid robots to replicate natural whole-body
locomotion patterns and execute seamless transitions by mimicking human
motions. First, we present a unified primitive-skeleton motion retargeting to
mitigate morphological differences between arbitrary human demonstrators and
humanoid robots. An adversarial critic component is integrated with
Reinforcement Learning (RL) to guide the control policy to produce behaviors
aligned with the data distribution of mixed reference motions. Additionally, we
employ a specific Integral Probabilistic Metric (IPM), namely the Wasserstein-1
distance with a novel soft boundary constraint to stabilize the training
process and prevent model collapse. Our system is evaluated on a full-sized
humanoid JAXON in the simulator. The resulting control policy demonstrates a
wide range of locomotion patterns, including standing, push-recovery, squat
walking, human-like straight-leg walking, and dynamic running. Notably, even in
the absence of transition motions in the demonstration dataset, robots showcase
an emerging ability to transit naturally between distinct locomotion patterns
as desired speed changes
- …