54 research outputs found
Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose Reconstruction in a Diffusion Framework
Monocular 3D human pose estimation poses significant challenges due to the
inherent depth ambiguities that arise during the reprojection process from 2D
to 3D. Conventional approaches that rely on estimating an over-fit projection
matrix struggle to effectively address these challenges and often result in
noisy outputs. Recent advancements in diffusion models have shown promise in
incorporating structural priors to address reprojection ambiguities. However,
there is still ample room for improvement as these methods often overlook the
exploration of correlation between the 2D and 3D joint-level features. In this
study, we propose a novel cross-channel embedding framework that aims to fully
explore the correlation between joint-level features of 3D coordinates and
their 2D projections. In addition, we introduce a context guidance mechanism to
facilitate the propagation of joint graph attention across latent channels
during the iterative diffusion process. To evaluate the effectiveness of our
proposed method, we conduct experiments on two benchmark datasets, namely
Human3.6M and MPI-INF-3DHP. Our results demonstrate a significant improvement
in terms of reconstruction accuracy compared to state-of-the-art methods. The
code for our method will be made available online for further reference
A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion
Multi-person motion capture can be challenging due to ambiguities caused by
severe occlusion, fast body movement, and complex interactions. Existing
frameworks build on 2D pose estimations and triangulate to 3D coordinates via
reasoning the appearance, trajectory, and geometric consistencies among
multi-camera observations. However, 2D joint detection is usually incomplete
and with wrong identity assignments due to limited observation angle, which
leads to noisy 3D triangulation results. To overcome this issue, we propose to
explore the short-range autoregressive characteristics of skeletal motion using
transformer. First, we propose an adaptive, identity-aware triangulation module
to reconstruct 3D joints and identify the missing joints for each identity. To
generate complete 3D skeletal motion, we then propose a Dual-Masked
Auto-Encoder (D-MAE) which encodes the joint status with both
skeletal-structural and temporal position encoding for trajectory completion.
D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton
definitions to be conveniently deployed under the same framework. In order to
demonstrate the proposed model's capability in dealing with severe data loss
scenarios, we contribute a high-accuracy and challenging motion capture dataset
of multi-person interactions with severe occlusion. Evaluations on both
benchmark and our new dataset demonstrate the efficiency of our proposed model,
as well as its advantage against the other state-of-the-art methods
Recommended from our members
Towards Direct Simultaneous Speech Translation
Simultaneous speech translation (SimulST) is widely useful in many cross-lingual communication scenarios, including multinational conferences and international traveling. Since text-based simultaneous machine translation (SimulMT) has achieved great success in recent years. The conventional cascaded approach for SimulST uses a pipeline of streaming ASR followed by simultaneous MT but suffers from error propagation and extra latency. Recent efforts attempt to directly translate the source speech into the target text or speech simultaneously, but this is much harder due to the combination of separate tasks. In this dissertation, we focus on improving simultaneous translation model, enabling it to handle speech input and directly generate the translated text in the target language. First, we investigate how to improve simultaneous translation by incorporating generated more monotonic pseudo references in training. These pseudo references with fewer reorderings cause fewer anticipations and can substantially improve simultaneous translation quality. Then, we propose an ASR-assisted direct SimulST framework. The model can directly translate from the given speech with a wait-k policy guided by a synchronized streaming ASR. However, speech translation tasks suffer from data scarcity problems. To alleviate the issue, we next introduce a Fused Acoustic and Text Masked Language Model (FAT-MLM), which jointly learns a unified representation for both acoustic and text input from various types of corpora, including parallel data for speech recognition and machine translation, and even pure speech and text data. By finetuning from FAT, the speech translation model can be substantially improved. Besides that, we further extend FAT to cross-lingual speech synthesis. Our proposed model can clone the voice of the source speaker and generate the corresponding speech in the target language
ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph
To generate dance that temporally and aesthetically matches the music is a
challenging problem, as the following factors need to be considered. First, the
aesthetic styles and messages conveyed by the motion and music should be
consistent. Second, the beats of the generated motion should be locally aligned
to the musical features. And finally, basic choreomusical rules should be
observed, and the motion generated should be diverse. To address these
challenges, we propose ChoreoGraph, which choreographs high-quality dance
motion for a given piece of music over a Dynamic Graph. A data-driven learning
strategy is proposed to evaluate the aesthetic style and rhythmic connections
between music and motion in a progressively learned cross-modality embedding
space. The motion sequences will be beats-aligned based on the music segments
and then incorporated as nodes of a Dynamic Motion Graph. Compatibility factors
such as the style and tempo consistency, motion context connection, action
completeness, and transition smoothness are comprehensively evaluated to
determine the node transition in the graph. We demonstrate that our
repertoire-based framework can generate motions with aesthetic consistency and
robustly extensible in diversity. Both quantitative and qualitative experiment
results show that our proposed model outperforms other baseline models
- …