97 research outputs found
LAC: Latent Action Composition for Skeleton-based Action Segmentation
Skeleton-based action segmentation requires recognizing composable actions in
untrimmed videos. Current approaches decouple this problem by first extracting
local visual features from skeleton sequences and then processing them by a
temporal model to classify frame-wise actions. However, their performances
remain limited as the visual features cannot sufficiently express composable
actions. In this context, we propose Latent Action Composition (LAC), a novel
self-supervised framework aiming at learning from synthesized composable
motions for skeleton-based action segmentation. LAC is composed of a novel
generation module towards synthesizing new sequences. Specifically, we design a
linear latent space in the generator to represent primitive motion. New
composed motions can be synthesized by simply performing arithmetic operations
on latent representations of multiple input skeleton sequences. LAC leverages
such synthesized sequences, which have large diversity and complexity, for
learning visual representations of skeletons in both sequence and frame spaces
via contrastive learning. The resulting visual encoder has a high expressive
power and can be effectively transferred onto action segmentation tasks by
end-to-end fine-tuning without the need for additional temporal models. We
conduct a study focusing on transfer-learning and we show that representations
learned from pre-trained LAC outperform the state-of-the-art by a large margin
on TSU, Charades, PKU-MMD datasets.Comment: ICCV 202
Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis
Synthesizing dynamic appearances of humans in motion plays a central role in applications such as ARWR and video editing. While many recent methods have been proposed to tackle this problem,handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper,we propose a video based appearance synthesis method that tackles such challenges and demonstrates high quality results for in-the-wild videos that have not been shown before. Specifically,we adopt a StyleGAN based architecture to the task of person specific video based motion retargeting. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes as well as regularizing the single frame based pose estimates to improve temporal coherency. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the-art performance both qualitatively and quantitatively
Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting
In this paper, we introduce a neural rendering pipeline for transferring the
facial expressions, head pose, and body movements of one person in a source
video to another in a target video. We apply our method to the challenging case
of Sign Language videos: given a source video of a sign language user, we can
faithfully transfer the performed manual (e.g., handshape, palm orientation,
movement, location) and non-manual (e.g., eye gaze, facial expressions, mouth
patterns, head, and body movements) signs to a target video in a
photo-realistic manner. Our method can be used for Sign Language Anonymization,
Sign Language Production (synthesis module), as well as for reenacting other
types of full body activities (dancing, acting performance, exercising, etc.).
We conduct detailed qualitative and quantitative evaluations and comparisons,
which demonstrate the particularly promising and realistic results that we
obtain and the advantages of our method over existing approaches.Comment: Accepted at AI4CC Workshop at CVPR 202
MotionBERT: A Unified Perspective on Learning Human Motion Representations
We present a unified perspective on tackling various human-centric video
tasks by learning human motion representations from large-scale and
heterogeneous data resources. Specifically, we propose a pretraining stage in
which a motion encoder is trained to recover the underlying 3D motion from
noisy partial 2D observations. The motion representations acquired in this way
incorporate geometric, kinematic, and physical knowledge about human motion,
which can be easily transferred to multiple downstream tasks. We implement the
motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)
neural network. It could capture long-range spatio-temporal relationships among
the skeletal joints comprehensively and adaptively, exemplified by the lowest
3D pose estimation error so far when trained from scratch. Furthermore, our
proposed framework achieves state-of-the-art performance on all three
downstream tasks by simply finetuning the pretrained motion encoder with a
simple regression head (1-2 layers), which demonstrates the versatility of the
learned motion representations. Code and models are available at
https://motionbert.github.io/Comment: ICCV 2023 Camera Read
Disentangled Representation Learning
Disentangled Representation Learning (DRL) aims to learn a model capable of
identifying and disentangling the underlying factors hidden in the observable
data in representation form. The process of separating underlying factors of
variation into variables with semantic meaning benefits in learning explainable
representations of data, which imitates the meaningful understanding process of
humans when observing an object or relation. As a general learning strategy,
DRL has demonstrated its power in improving the model explainability,
controlability, robustness, as well as generalization capacity in a wide range
of scenarios such as computer vision, natural language processing, data mining
etc. In this article, we comprehensively review DRL from various aspects
including motivations, definitions, methodologies, evaluations, applications
and model designs. We discuss works on DRL based on two well-recognized
definitions, i.e., Intuitive Definition and Group Theory Definition. We further
categorize the methodologies for DRL into four groups, i.e., Traditional
Statistical Approaches, Variational Auto-encoder Based Approaches, Generative
Adversarial Networks Based Approaches, Hierarchical Approaches and Other
Approaches. We also analyze principles to design different DRL models that may
benefit different tasks in practical applications. Finally, we point out
challenges in DRL as well as potential research directions deserving future
investigations. We believe this work may provide insights for promoting the DRL
research in the community.Comment: 22 pages,9 figure
- …